Stream audio and receive real-time transcription results using the Universal-3 Pro Streaming model. The most accurate streaming model for voice agents that demand the highest quality, with best-in-class accuracy and advanced prompting capabilities. Supports: English, Spanish, German, French, Portuguese, and Italian.Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
To use the EU server for Streaming STT, replace
streaming.assemblyai.com with streaming.eu.assemblyai.com.WSSwss://streaming.assemblyai.com/v3/ws
Authentication
Authenticate by passing your API key in theAuthorization header when establishing the WebSocket connection. Alternatively, generate a temporary token and pass it via the token query parameter.
Use your API key for authentication, or alternatively generate a temporary token and pass it via the
token query parameter.Query parameters
The speech model to use. Allowed values:
u3-rt-pro.Encoding of the audio stream. Allowed values:
pcm_s16le, pcm_mulaw.Optional time in seconds of inactivity before session is terminated (integer, minimum 5, maximum 3600). If not set, no inactivity timeout is applied.
A list of words and phrases to improve recognition accuracy for. See Keyterms Prompting for more details.
Whether to return
language_code and language_confidence in turn messages. Universal-3 Pro Streaming natively code-switches between English, Spanish, German, French, Portuguese, and Italian by default without any necessary configuration. Allowed values: true, false.Maximum silence in milliseconds before the turn is forced to end, regardless of punctuation. See Configuring Turn Detection for configuration details.
Silence duration in milliseconds before a speculative end-of-turn check. If terminal punctuation is found, the turn ends. Otherwise, a partial is emitted and the turn continues. See Configuring Turn Detection for configuration details.
Prompting is a beta feature. Custom transcription instructions for the model. When not provided, a default prompt optimized for native turn detection is used automatically. See the Prompting Guide for details.
Sample rate of the audio stream.
Whether to enable Streaming Speaker Diarization. When enabled, each Turn event will include a
speaker_label field and each final word in the words array will include a speaker field for word-level speaker attribution. Allowed values: true, false.The maximum number of speakers expected in the audio stream (integer, 1-10). Setting this can improve speaker label accuracy when you know the number of speakers in advance. Only used when
speaker_labels is enabled. See Streaming Diarization for more details.API token for authentication (if using a temporary token).
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.
Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When disabled (default), only one early partial is emitted near turn start. When enabled, additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. The first partial (at 750ms) is unaffected.
Whether to emit partial transcripts during the turn. When enabled (default), partial transcripts are forwarded as speech is still in progress alongside final turns. When disabled, only final turns (with end_of_turn true) are sent. Defaults to false when redact_pii is enabled, to prevent unredacted partial transcripts from reaching the client; set explicitly to true to override.
How soon the first partial is emitted in milliseconds. Useful for tuning voice agent barge-in responsiveness or allowing earlier partials for early LLM inference. Larger values are more confident on interruptions, smaller values result in faster time to first partial.
Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to
"medical-v1" to enable Medical Mode for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (en), Spanish (es), German (de), French (fr). If used with an unsupported language, the parameter is ignored and a warning is returned. Allowed values: medical-v1.Filter profanity from the transcribed text, can be true or false. See Profanity Filtering for more details. Allowed values:
true, false.Redact PII from the transcribed text using the Redact PII model, can be true or false. Only applies to final turns. See PII Redaction for more details. Allowed values:
true, false.The list of PII Redaction policies to enable. Requires
redact_pii to be true. See PII redaction for more details.The replacement logic for detected PII, can be
entity_name or hash. Requires redact_pii to be true. See PII redaction for more details. Allowed values: entity_name, hash.JSON-stringified LLM Gateway configuration that processes each finalized turn. Follows the same interface as the Chat Completions endpoint and accepts
model, messages, tools, tool_choice, post_processing_steps, and max_tokens. See Apply LLM Gateway to Streaming for the full schema and examples.Messages sent by the client
Audio Data Chunk
Client sends audio data as raw binary. Send audio data chunks for transcription. The payload must be of type bytes and contain audio data between 50ms and 1000ms in length. When streaming from a pre-recorded file, pace the chunks at approximately real-time (for example, sleep for the chunk’s duration between sends) — sending chunks in a tight loop can produce inconsistent Turn messages. See the Universal-3 Pro Streaming quickstart to get started. The payload is raw binary audio data (application/octet-stream), not JSON.
Update Streaming Configuration
Client message to update streaming configuration parameters during an active session.Allowed values:
UpdateConfiguration.Prompting is a beta feature. Custom transcription instructions for the model. See the Prompting Guide for details.
A list of words and phrases to boost recognition for. See Keyterms Prompting for more details.
Silence duration in milliseconds before a speculative end-of-turn check. See Configuring Turn Detection for configuration details.
Maximum silence in milliseconds before the turn is forced to end, regardless of punctuation. See Configuring Turn Detection for configuration details.
Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When disabled (default), only one early partial is emitted near turn start. When enabled, additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. The first partial (at 750ms) is unaffected.
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.
How soon the first partial is emitted in milliseconds. Useful for tuning voice agent barge-in responsiveness or allowing earlier partials for early LLM inference. Larger values are more confident on interruptions, smaller values result in faster time to first partial.
Force Endpoint
Client message to manually force an endpoint in the transcription.Allowed values:
ForceEndpoint.Terminate Session (Client Initiated)
Client message to gracefully terminate the streaming session.Allowed values:
Terminate.Keep Alive
Client message to reset the inactivity timeout timer. This is not necessary by default — sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached. This message is only needed if you have setinactivity_timeout and want to keep the session open during periods where no audio is being sent.
Allowed values:
KeepAlive.Messages received from the server
Session Begins Confirmation
Server message indicating the streaming session has successfully started.Identifies the type of the message. Allowed values:
Begin.Unique identifier for the streaming session.
Unix timestamp indicating when the session will expire.
Speech Started
Server message indicating that speech has been detected.Identifies the type of the message. Allowed values:
SpeechStarted.The timestamp in milliseconds when speech was detected, relative to the beginning of the audio stream.
The confidence score that speech has started.
Formatted Turn Result
Server message containing a formatted turn-based transcription result.Allowed values:
Turn.Order of this turn in the conversation.
Whether this turn has been formatted. For Universal-3 Pro Streaming, this always matches
end_of_turn.Whether this marks the end of a turn. See Turn Detection for more information.
Transcript of all finalized words in the turn.
Finalized transcript of the turn, populated only on end_of_turn messages. Empty string on all other Turn messages. Equivalent to transcript when populated.
The language of the turn. Only populated when language detection is enabled and an utterance is complete or turn is final.
The confidence score for the detected language, between 0 (low confidence) and 1 (high confidence). Only populated when language detection is enabled and an utterance is complete or turn is final.
The speaker label for this turn (e.g.
A, B). Only present when speaker_labels is enabled. Short turns with less than approximately 1 second of audio will have the label UNKNOWN. See Streaming Diarization for more details.The confidence score that this is the end of a turn, between 0.0 (low confidence) and 1.0 (high confidence). For Universal-3 Pro Streaming, this is 1.0 when
end_of_turn is true and 0.0 otherwise.Array of word-level details for this turn.
Session Terminated (Server Confirmation)
Server message confirming session termination with session statistics.Indicates the session has been terminated. Allowed values:
Termination.Duration of the audio in seconds.
Duration of the session in seconds.
LLM Gateway Response
Server message containing an LLM Gateway response for a finalized turn.Identifies the type of the message. Allowed values:
LLMGatewayResponse.The order of the finalized turn that triggered the LLM Gateway call.
The finalized turn transcript that triggered the LLM Gateway call.
The chat completions response from the LLM Gateway.