Skip to main content

Documentation Index

Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Stream audio and receive real-time transcription results using the Universal-3 Pro Streaming model. The most accurate streaming model for voice agents that demand the highest quality, with best-in-class accuracy and advanced prompting capabilities. Supports: English, Spanish, German, French, Portuguese, and Italian.
To use the EU server for Streaming STT, replace streaming.assemblyai.com with streaming.eu.assemblyai.com.
WSSwss://streaming.assemblyai.com/v3/ws

Authentication

Authenticate by passing your API key in the Authorization header when establishing the WebSocket connection. Alternatively, generate a temporary token and pass it via the token query parameter.
Authorization
string
Use your API key for authentication, or alternatively generate a temporary token and pass it via the token query parameter.

Query parameters

speech_model
string
The speech model to use. Allowed values: u3-rt-pro.
encoding
string
default:"pcm_s16le"
Encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw.
inactivity_timeout
integer
Optional time in seconds of inactivity before session is terminated (integer, minimum 5, maximum 3600). If not set, no inactivity timeout is applied.
keyterms_prompt
string
A list of words and phrases to improve recognition accuracy for. See Keyterms Prompting for more details.
language_detection
boolean
default:"false"
Whether to return language_code and language_confidence in turn messages. Universal-3 Pro Streaming natively code-switches between English, Spanish, German, French, Portuguese, and Italian by default without any necessary configuration. Allowed values: true, false.
max_turn_silence
integer
default:"1000"
Maximum silence in milliseconds before the turn is forced to end, regardless of punctuation. See Configuring Turn Detection for configuration details.
min_turn_silence
integer
default:"100"
Silence duration in milliseconds before a speculative end-of-turn check. If terminal punctuation is found, the turn ends. Otherwise, a partial is emitted and the turn continues. See Configuring Turn Detection for configuration details.
prompt
string
Prompting is a beta feature. Custom transcription instructions for the model. When not provided, a default prompt optimized for native turn detection is used automatically. See the Prompting Guide for details.
sample_rate
integer
default:"16000"
Sample rate of the audio stream.
speaker_labels
boolean
default:"false"
Whether to enable Streaming Speaker Diarization. When enabled, each Turn event will include a speaker_label field and each final word in the words array will include a speaker field for word-level speaker attribution. Allowed values: true, false.
max_speakers
integer
The maximum number of speakers expected in the audio stream (integer, 1-10). Setting this can improve speaker label accuracy when you know the number of speakers in advance. Only used when speaker_labels is enabled. See Streaming Diarization for more details.
token
string
API token for authentication (if using a temporary token).
vad_threshold
number
default:"0.3"
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.
continuous_partials
boolean
default:"false"
Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When disabled (default), only one early partial is emitted near turn start. When enabled, additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. The first partial (at 750ms) is unaffected.
include_partial_turns
boolean
default:"true"
Whether to emit partial transcripts during the turn. When enabled (default), partial transcripts are forwarded as speech is still in progress alongside final turns. When disabled, only final turns (with end_of_turn true) are sent. Defaults to false when redact_pii is enabled, to prevent unredacted partial transcripts from reaching the client; set explicitly to true to override.
interruption_delay
integer
default:"500"
How soon the first partial is emitted in milliseconds. Useful for tuning voice agent barge-in responsiveness or allowing earlier partials for early LLM inference. Larger values are more confident on interruptions, smaller values result in faster time to first partial.
domain
string
Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to "medical-v1" to enable Medical Mode for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (en), Spanish (es), German (de), French (fr). If used with an unsupported language, the parameter is ignored and a warning is returned. Allowed values: medical-v1.
filter_profanity
boolean
default:"false"
Filter profanity from the transcribed text, can be true or false. See Profanity Filtering for more details. Allowed values: true, false.
redact_pii
boolean
default:"false"
Redact PII from the transcribed text using the Redact PII model, can be true or false. Only applies to final turns. See PII Redaction for more details. Allowed values: true, false.
redact_pii_policies
string
The list of PII Redaction policies to enable. Requires redact_pii to be true. See PII redaction for more details.
redact_pii_sub
string
default:"hash"
The replacement logic for detected PII, can be entity_name or hash. Requires redact_pii to be true. See PII redaction for more details. Allowed values: entity_name, hash.
llm_gateway
string
JSON-stringified LLM Gateway configuration that processes each finalized turn. Follows the same interface as the Chat Completions endpoint and accepts model, messages, tools, tool_choice, post_processing_steps, and max_tokens. See Apply LLM Gateway to Streaming for the full schema and examples.

Messages sent by the client

Audio Data Chunk

Client sends audio data as raw binary. Send audio data chunks for transcription. The payload must be of type bytes and contain audio data between 50ms and 1000ms in length. When streaming from a pre-recorded file, pace the chunks at approximately real-time (for example, sleep for the chunk’s duration between sends) — sending chunks in a tight loop can produce inconsistent Turn messages. See the Universal-3 Pro Streaming quickstart to get started. The payload is raw binary audio data (application/octet-stream), not JSON.
\x10\x00\x20\x00\x30\x00\x40\x00\x30\x00\x20\x00\x10\x00\x00\x00\xf0\xff\xe0\xff\xd0\xff\xc0\xff

Update Streaming Configuration

Client message to update streaming configuration parameters during an active session.
type
string
required
Allowed values: UpdateConfiguration.
prompt
string
Prompting is a beta feature. Custom transcription instructions for the model. See the Prompting Guide for details.
keyterms_prompt
array
A list of words and phrases to boost recognition for. See Keyterms Prompting for more details.
min_turn_silence
integer
Silence duration in milliseconds before a speculative end-of-turn check. See Configuring Turn Detection for configuration details.
max_turn_silence
integer
Maximum silence in milliseconds before the turn is forced to end, regardless of punctuation. See Configuring Turn Detection for configuration details.
continuous_partials
boolean
Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When disabled (default), only one early partial is emitted near turn start. When enabled, additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. The first partial (at 750ms) is unaffected.
vad_threshold
number
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.
interruption_delay
integer
How soon the first partial is emitted in milliseconds. Useful for tuning voice agent barge-in responsiveness or allowing earlier partials for early LLM inference. Larger values are more confident on interruptions, smaller values result in faster time to first partial.
{
  "type": "UpdateConfiguration",
  "prompt": "Transcribe product names accurately.",
  "keyterms_prompt": ["AssemblyAI", "Universal-3"],
  "min_turn_silence": 700,
  "max_turn_silence": 1600
}

Force Endpoint

Client message to manually force an endpoint in the transcription.
type
string
required
Allowed values: ForceEndpoint.
{
  "type": "ForceEndpoint"
}

Terminate Session (Client Initiated)

Client message to gracefully terminate the streaming session.
type
string
required
Allowed values: Terminate.
{
  "type": "Terminate"
}

Keep Alive

Client message to reset the inactivity timeout timer. This is not necessary by default — sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached. This message is only needed if you have set inactivity_timeout and want to keep the session open during periods where no audio is being sent.
type
string
required
Allowed values: KeepAlive.
{
  "type": "KeepAlive"
}

Messages received from the server

Session Begins Confirmation

Server message indicating the streaming session has successfully started.
type
string
required
Identifies the type of the message. Allowed values: Begin.
id
string
required
Unique identifier for the streaming session.
expires_at
integer
required
Unix timestamp indicating when the session will expire.
{
  "type": "Begin",
  "id": "b8e7c1a2-4f3d-4e90-9a6b-1c2d3e4f5a6b",
  "expires_at": 1748390400
}

Speech Started

Server message indicating that speech has been detected.
type
string
required
Identifies the type of the message. Allowed values: SpeechStarted.
timestamp
integer
required
The timestamp in milliseconds when speech was detected, relative to the beginning of the audio stream.
confidence
number
required
The confidence score that speech has started.
{
  "type": "SpeechStarted",
  "timestamp": 1840,
  "confidence": 0.95
}

Formatted Turn Result

Server message containing a formatted turn-based transcription result.
type
string
required
Allowed values: Turn.
turn_order
integer
required
Order of this turn in the conversation.
turn_is_formatted
boolean
required
Whether this turn has been formatted. For Universal-3 Pro Streaming, this always matches end_of_turn.
end_of_turn
boolean
required
Whether this marks the end of a turn. See Turn Detection for more information.
transcript
string
required
Transcript of all finalized words in the turn.
utterance
string
Finalized transcript of the turn, populated only on end_of_turn messages. Empty string on all other Turn messages. Equivalent to transcript when populated.
language_code
string
The language of the turn. Only populated when language detection is enabled and an utterance is complete or turn is final.
language_confidence
number
The confidence score for the detected language, between 0 (low confidence) and 1 (high confidence). Only populated when language detection is enabled and an utterance is complete or turn is final.
speaker_label
string
The speaker label for this turn (e.g. A, B). Only present when speaker_labels is enabled. Short turns with less than approximately 1 second of audio will have the label UNKNOWN. See Streaming Diarization for more details.
end_of_turn_confidence
number
required
The confidence score that this is the end of a turn, between 0.0 (low confidence) and 1.0 (high confidence). For Universal-3 Pro Streaming, this is 1.0 when end_of_turn is true and 0.0 otherwise.
words
array
required
Array of word-level details for this turn.
{
  "type": "Turn",
  "turn_order": 0,
  "turn_is_formatted": true,
  "end_of_turn": true,
  "transcript": "Hello world.",
  "end_of_turn_confidence": 1,
  "words": [
    {
      "text": "Hello",
      "start": 0,
      "end": 500,
      "confidence": 0.99
    },
    {
      "text": "world.",
      "start": 500,
      "end": 1000,
      "confidence": 0.98
    }
  ]
}

Session Terminated (Server Confirmation)

Server message confirming session termination with session statistics.
type
string
required
Indicates the session has been terminated. Allowed values: Termination.
audio_duration_seconds
integer
required
Duration of the audio in seconds.
session_duration_seconds
integer
required
Duration of the session in seconds.
{
  "type": "Termination",
  "audio_duration_seconds": 0,
  "session_duration_seconds": 0
}

LLM Gateway Response

Server message containing an LLM Gateway response for a finalized turn.
type
string
required
Identifies the type of the message. Allowed values: LLMGatewayResponse.
turn_order
integer
required
The order of the finalized turn that triggered the LLM Gateway call.
transcript
string
required
The finalized turn transcript that triggered the LLM Gateway call.
data
object
required
The chat completions response from the LLM Gateway.
{
  "type": "LLMGatewayResponse",
  "turn_order": 0,
  "transcript": "Hello world.",
  "data": {
    "request_id": "c4a91f7e-2b8d-4c50-8e16-9d6b3a2f1c08",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "Hello! How can I help?"
        },
        "finish_reason": "stop"
      }
    ],
    "usage": {
      "input_tokens": 12,
      "output_tokens": 8,
      "total_tokens": 20,
      "prompt_tokens_details": {},
      "completion_tokens_details": {}
    },
    "request": {},
    "response_time": 842
  }
}