{
"type": "<string>",
"id": "<string>",
"expires_at": 123
}{
"type": "<string>",
"timestamp": 123,
"confidence": 123
}{
"type": "Turn",
"turn_order": 0,
"turn_is_formatted": true,
"end_of_turn": true,
"transcript": "Hello world.",
"end_of_turn_confidence": 1,
"words": [
{
"text": "Hello",
"start": 0,
"end": 500,
"confidence": 0.99
},
{
"text": "world.",
"start": 500,
"end": 1000,
"confidence": 0.98
}
]
}{
"type": "SpeakerRevision",
"revisions": [
{
"turn_order": 3,
"speaker_label": "B",
"words": [
{
"text": "Hello",
"speaker": "B",
"start": 1200,
"end": 1450
},
{
"text": "there.",
"speaker": "B",
"start": 1450,
"end": 1780
}
]
},
{
"turn_order": 7,
"speaker_label": "A",
"words": [
{
"text": "Got it.",
"speaker": "A",
"start": 4100,
"end": 4520
}
]
}
]
}{
"type": "<string>",
"audio_duration_seconds": 123,
"session_duration_seconds": 123
}{
"type": "LLMGatewayResponse",
"turn_order": 0,
"transcript": "Hello world.",
"data": {
"request_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help?"
},
"finish_reason": "stop"
}
],
"usage": {
"input_tokens": 12,
"output_tokens": 8,
"total_tokens": 20,
"prompt_tokens_details": {},
"completion_tokens_details": {}
},
"request": {},
"response_time": 123456789
}
}"\\x10\\x00\\x20\\x00\\x30\\x00\\x40\\x00\\x30\\x00\\x20\\x00\\x10\\x00\\x00\\x00\\xf0\\xff\\xe0\\xff\\xd0\\xff\\xc0\\xff"{
"type": "UpdateConfiguration",
"prompt": "Transcribe product names accurately.",
"keyterms_prompt": [
"AssemblyAI",
"Universal-3"
],
"min_turn_silence": 700,
"max_turn_silence": 1600,
"agent_context": "Sure — what date would you like to book?"
}{
"type": "ForceEndpoint"
}{
"type": "<string>"
}{
"type": "KeepAlive"
}Universal-3 Pro Streaming
Stream audio and receive real-time transcription results using the Universal-3 Pro Streaming model. The most accurate streaming model for voice agents that demand the highest quality, with best-in-class accuracy and advanced prompting capabilities.
Supports: English, Spanish, German, French, Portuguese, and Italian.
To use the EU server for Real-time STT, replace streaming.assemblyai.com with
streaming.eu.assemblyai.com.
{
"type": "<string>",
"id": "<string>",
"expires_at": 123
}{
"type": "<string>",
"timestamp": 123,
"confidence": 123
}{
"type": "Turn",
"turn_order": 0,
"turn_is_formatted": true,
"end_of_turn": true,
"transcript": "Hello world.",
"end_of_turn_confidence": 1,
"words": [
{
"text": "Hello",
"start": 0,
"end": 500,
"confidence": 0.99
},
{
"text": "world.",
"start": 500,
"end": 1000,
"confidence": 0.98
}
]
}{
"type": "SpeakerRevision",
"revisions": [
{
"turn_order": 3,
"speaker_label": "B",
"words": [
{
"text": "Hello",
"speaker": "B",
"start": 1200,
"end": 1450
},
{
"text": "there.",
"speaker": "B",
"start": 1450,
"end": 1780
}
]
},
{
"turn_order": 7,
"speaker_label": "A",
"words": [
{
"text": "Got it.",
"speaker": "A",
"start": 4100,
"end": 4520
}
]
}
]
}{
"type": "<string>",
"audio_duration_seconds": 123,
"session_duration_seconds": 123
}{
"type": "LLMGatewayResponse",
"turn_order": 0,
"transcript": "Hello world.",
"data": {
"request_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help?"
},
"finish_reason": "stop"
}
],
"usage": {
"input_tokens": 12,
"output_tokens": 8,
"total_tokens": 20,
"prompt_tokens_details": {},
"completion_tokens_details": {}
},
"request": {},
"response_time": 123456789
}
}"\\x10\\x00\\x20\\x00\\x30\\x00\\x40\\x00\\x30\\x00\\x20\\x00\\x10\\x00\\x00\\x00\\xf0\\xff\\xe0\\xff\\xd0\\xff\\xc0\\xff"{
"type": "UpdateConfiguration",
"prompt": "Transcribe product names accurately.",
"keyterms_prompt": [
"AssemblyAI",
"Universal-3"
],
"min_turn_silence": 700,
"max_turn_silence": 1600,
"agent_context": "Sure — what date would you like to book?"
}{
"type": "ForceEndpoint"
}{
"type": "<string>"
}{
"type": "KeepAlive"
}The speech model to use.
u3-rt-proUse your API key for authentication, or alternatively generate a temporary token and pass it via the token query parameter.
Your voice agent's spoken text (TTS reply). The model uses this as context for the next user turn, which improves accuracy on short or ambiguous replies and on spelled-out entities like emails or IDs. Set at connection time to seed the model with your agent's opening greeting, and/or update mid-stream via UpdateConfiguration after each agent reply. Each UpdateConfiguration replaces the previously set value. Maximum ~1500 characters per value. Universal-3 Pro Streaming only. See Context Carryover.
Encoding of the audio stream.
pcm_s16le, pcm_mulawOptional time in seconds of inactivity before session is terminated (integer, minimum 5, maximum 3600). If not set, no inactivity timeout is applied.
A list of words and phrases to improve recognition accuracy for. Maximum 100 terms. See Keyterms Prompting for more details.
Whether to return language_code and language_confidence in turn messages. Universal-3 Pro Streaming natively code-switches between English, Spanish, German, French, Portuguese, and Italian by default without any necessary configuration.
true, falseLatency and accuracy preset that controls the model's turn-detection and partial-emission defaults. max_accuracy favors transcription quality, min_latency favors speed, and balanced trades off between the two. When omitted, the server applies its own preset, which determines the defaults for mode-dependent fields such as interruption_delay, min_turn_silence, vad_threshold, previous_context_n_turns, and continuous_partials. Universal-3 Pro Streaming only.
max_accuracy, min_latency, balancedSteers transcription toward a specific language on a per-token basis. Pass a single-language code to bias output toward that language. When unset, no steering is applied and the model code-switches natively across its supported languages. This is distinct from language_detection, which only controls whether the detected language is reported. Universal-3 Pro Streaming only. Universal-3 Pro Streaming supports en, es, fr, de, it, and pt. The additional codes (tr, nl, sv, no, da, fi, hi, vi, ar, he, ja, ur, zh) are coming soon with Universal-3-5 Pro Streaming (preview).
en, es, fr, de, it, pt, tr, nl, sv, no, da, fi, hi, vi, ar, he, ja, ur, zhEnable Voice Focus to isolate the primary voice and suppress background noise before transcription. Set to near-field for close-talking microphones (for example, headsets or phones) or far-field for distant microphones (for example, conference rooms). When unset, Voice Focus is off.
near-field, far-fieldControls how aggressively Voice Focus suppresses background audio, from 0.0 (least) to 1.0 (most). Higher values are more aggressive. Requires voice_focus to be set, otherwise a validation error is returned.
Maximum silence in milliseconds before the turn is forced to end, regardless of punctuation. See Configuring Turn Detection for configuration details.
Silence duration in milliseconds before a speculative end-of-turn check. If terminal punctuation is found, the turn ends. Otherwise, a partial is emitted and the turn continues. Clamped to the range 50 to 10000 ms. The default is mode-dependent and set by the server based on the mode preset. See Configuring Turn Detection for configuration details.
A contextual prompt describing what the audio is about, such as its domain, scenario, or full conversation details. The model uses this context to better recognize the vocabulary it makes likely, for example priming medical terminology for a cardiology call. It carries context about your audio, not transcription instructions, so formatting or behavioral commands (such as punctuation rules) are not supported. Maximum ~1500 characters. Universal-3 Pro Streaming only. See Prompting and keyterms for details.
Advanced. Maximum number of prior conversation entries (user transcripts and any agent_context values) carried forward as context for each transcription. Range 0 to 100. Set to 0 to disable automatic context carryover entirely. The default is mode-dependent and set by the server based on the mode preset. Most integrations should leave this unset. Universal-3 Pro Streaming only.
Sample rate of the audio stream in Hz. Accepts any integer from 8000 to 96000.
Whether to enable Streaming Speaker Diarization. When enabled, each Turn event will include a speaker_label field and each final word in the words array will include a speaker field for word-level speaker attribution.
true, falseThe maximum number of speakers expected in the audio stream (integer, 1-10). Setting this can improve speaker label accuracy when you know the number of speakers in advance. Only used when speaker_labels is enabled. See Streaming Diarization for more details.
API token for authentication (if using a temporary token).
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. The default is mode-dependent and set by the server based on the mode preset.
Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When enabled (default), additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. When disabled, only one early partial is emitted near turn start. The first partial (at 750ms) is unaffected. Universal-3 Pro Streaming only.
Whether to emit partial transcripts during the turn. When enabled (default), partial transcripts are forwarded as speech is still in progress alongside final turns. When disabled, only final turns (with end_of_turn true) are sent. Defaults to false when redact_pii is enabled, to prevent unredacted partial transcripts from reaching the client; set explicitly to true to override.
How soon the first partial is emitted in milliseconds. Accepts a value from 0 to 1000. Useful for tuning voice agent barge-in responsiveness or allowing earlier partials for early LLM inference. Larger values are more confident on interruptions, smaller values result in faster time to first partial. The server adds a fixed 256 ms on top of this value, so 0 yields an effective delay of 256 ms and 500 yields 756 ms. The default is mode-dependent and set by the server based on the mode preset. Universal-3 Pro Streaming only.
Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to "medical-v1" to enable Medical Mode for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (en), Spanish (es), German (de), French (fr).
medical-v1Filter profanity from the transcribed text, can be true or false. See Profanity Filtering for more details.
true, falseRedact PII from the transcribed text using the Redact PII model, can be true or false. Only applies to final turns. See PII Redaction for more details.
true, falseThe list of PII Redaction policies to enable. Requires redact_pii to be true. See PII redaction for more details.
The replacement logic for detected PII, can be entity_name or hash. Requires redact_pii to be true. See PII redaction for more details.
entity_name, hashJSON-stringified LLM Gateway configuration that processes each finalized turn. Follows the same interface as the Chat Completions endpoint and accepts model, messages, tools, tool_choice, post_processing_steps, and max_tokens. See Apply LLM Gateway to Streaming for the full schema and examples.
Server message indicating the streaming session has successfully started.
Server message indicating that speech has been detected.
Server message containing a formatted turn-based transcription result.
Server message containing corrected speaker labels for any turns that changed. Emitted as a single message after the client sends Terminate, when speaker_labels is enabled. A session may produce zero or one SpeakerRevision message; if sent, only changed turns are included in the revisions array. See Revised speaker labels for details.
Server message confirming session termination with session statistics.
Server message containing an LLM Gateway response for a finalized turn.
Client sends audio data as raw binary.
Client message to update streaming configuration parameters during an active session.
Client message to manually force an endpoint in the transcription.
Client message to gracefully terminate the streaming session.
Client message to reset the inactivity timeout timer. This is not necessary by default — sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached. This message is only needed if you have set inactivity_timeout and want to keep the session open during periods where no audio is being sent.
Was this page helpful?