Universal-3 Pro Streaming
Universal-3 Pro Streaming
Set up and configure Universal-3 Pro Streaming for real-time streaming transcription.
Universal-3 Pro Streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.
This model is fantastic for voice agents, agent assist, and all streaming use cases that don’t require partial transcriptions for every single subword — an early partial is emitted after 750ms of continuous speech, followed by silence-based partials as the speaker pauses (see Partials behavior for details). Universal-3 Pro Streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names — all with sub-300ms time to complete transcript latency.
Already using AssemblyAI streaming?
If you’re an existing AssemblyAI streaming user, you can quickly test
Universal-3 Pro Streaming by switching the speech_model parameter to
"u3-rt-pro" in your connection parameters. No other code changes are
required — just update the model and start streaming.
Streaming is billed per session
Universal-3 Pro Streaming is billed on the total duration that your WebSocket connection stays open, not on the amount of audio you send. Always send a Terminate message when you’re done with a stream — sessions that aren’t closed auto-close after 3 hours and are billed for the full duration. See Billing and pricing for details.
Quickstart
Get started with Universal-3 Pro Streaming using the code below. This example streams audio from your microphone and prints transcription results in real time — no custom prompt is needed, since Universal-3 Pro automatically applies a default prompt optimized for turn detection.
Python
Python SDK
JavaScript
JavaScript SDK
Prompting
Universal-3 Pro supports custom prompts and keyterms prompting to improve transcription accuracy for your use case. For detailed guidance on crafting effective prompts, default prompt behavior, and keyterms prompting, see the Prompting Guide (Streaming).
You can also boost recognition of specific terms using the keyterms_prompt parameter. See Keyterms prompting for details.
Configuring turn detection
Universal-3 Pro uses a punctuation-based turn detection system controlled by two parameters:
When silence reaches min_turn_silence, the model transcribes the audio and checks for terminal punctuation (. ? !):
- Terminal punctuation found — the turn ends and is emitted as a final transcript (
end_of_turn: true). - No terminal punctuation — a partial transcript is emitted (
end_of_turn: false) and the turn continues waiting.- If silence continues to
max_turn_silence, the turn is forced to end as a final transcript (end_of_turn: true) regardless of punctuation.
- If silence continues to
This differs from Universal-Streaming English and Multilingual, which use a confidence-based end-of-turn system controlled by end_of_turn_confidence_threshold.
Instead, Universal-3 Pro makes turn decisions based on ending punctuation after min_turn_silence has elapsed. Because of this, end_of_turn_confidence_threshold has no impact.
end_of_turn and turn_is_formatted
Because formatting is built into the end-of-turn system in Universal-3 Pro
streaming, there is only ever one end-of-turn transcript per turn and it is
always formatted. This means end_of_turn and turn_is_formatted always have
the same value for Universal-3 Pro streaming. You can reliably use
end_of_turn: true to detect a formatted, final end-of-turn transcript.
For example, to configure both parameters:
Partials behavior
Partials are Turn events where end_of_turn is false. They are produced in three ways:
- Early partial — emitted after 750ms of continuous speech by default, providing a fast transcript signal for barge-in and speculative inference without waiting for the speaker to pause. You can tune this timing with the
interruption_delayparameter (see Tuning early partial timing below). If the first attempt returns empty, it retries at 1500ms, 2250ms, and so on. Only one early partial is emitted per turn, but additional partials can be produced when the speaker pauses. - Silence-based partials — produced whenever
min_turn_silenceis met, but the ending punctuation doesn’t signal the end of a turn. Each period of silence can produce at most one partial. - Continuous partials — emitted approximately every 3 seconds while speech continues, regardless of silence. Each continuous partial covers the full transcript for the current turn so far. Enable with the
continuous_partialsconnection parameter.
There can be multiple partial transcripts per turn. If silence exceeds min_turn_silence, but speech resumes before max_turn_silence, the partial is emitted and the EOT check resets until the next period of silence.
If you’re running eager LLM inference on partial transcripts, we recommend setting min_turn_silence to 100.
Entity splitting (accuracy) vs Model Latency trade-off
Setting min_turn_silence too low can split entities like phone numbers and
emails. We have found LLM steps fix this for voice agents, but we recommend
testing carefully with your use case.
Continuous partials
For long, uninterrupted turns — such as a caller reading out a credit card number or giving a detailed explanation — silence-based partials may not fire often enough for your downstream consumers (LLMs, UI, eager inference) to keep up. Enable continuous_partials to receive a steady stream of non-final transcripts every ~3 seconds while speech continues.
Python
Python SDK
JavaScript
The first partial is still emitted at 750ms (or your configured interruption_delay). Continuous partials are non-final (end_of_turn: false) and each one covers the full transcript for the current turn so far. The final transcript is emitted as normal when the turn ends.
Tuning early partial timing
The interruption_delay parameter controls how soon the first partial transcript is emitted during a turn, directly affecting your time to first token (TTFT). This is the primary lever for tuning barge-in responsiveness and speculative LLM inference timing.
The server adds a minimum turn duration of 300ms on top of your configured value, so the effective timing is:
interruption_delay: 0→ ~300ms effective (fastest possible first partial)interruption_delay: 500→ ~800ms effective (default)interruption_delay: 1000→ ~1300ms effective (most confident, slowest TTFT)
Python
Python SDK
JavaScript
JavaScript SDK
You can also update interruption_delay mid-session via UpdateConfiguration — for example, lower it when the agent is speaking (for faster barge-in) and raise it when waiting for a user response:
Python
Python SDK
JavaScript
JavaScript SDK
When to adjust interruption_delay:
- Lower values (0–200ms) — Use when TTFT is critical and you want the earliest possible signal for speculative LLM inference or barge-in detection. The first partial may be less complete since less audio has been buffered.
- Default (500ms) — Balanced for most voice agent use cases. The first partial arrives with enough audio context to be useful without excessive delay.
- Higher values (500–1000ms) — Use when you prefer fewer, more confident partials and don’t need aggressive barge-in responsiveness. Reduces unnecessary early partials in scenarios where users tend to speak in longer turns.
See the UpdateConfiguration examples above for dynamic mid-session adjustment.
Formatting and turn detection
Because the model applies punctuation and formatting intelligently, this works well with formatting-based turn detection. For example, based purely on vocal tone:
"Pizza."— Statement"Pizza?"— Questioning tone"Pizza---"— Trailing off
The punctuation quality has been excellent when paired with custom turn detection models.
From testing, mid-turn emission looks like this — where each line is an additional partial leading up to the final end-of-turn transcript:
Each partial is emitted during a silence period within the turn. The final line with terminal punctuation triggers the end of turn.
Forcing a turn endpoint
You can force the current turn to end immediately by sending a ForceEndpoint message:
This is useful when your application knows the user has finished speaking based on external signals (e.g., a button press).
Specifying the transcription language
Universal-3 Pro Streaming does not support the language_code connection parameter — it is silently ignored. The language_detection parameter only controls whether language metadata (such as language_code and language_confidence) is returned on Turn events; it does not affect which language the model transcribes.
To guide the transcription language, use the prompt parameter as described below.
Providing language information ahead of time in the prompt helps the model with transcription tasks. For example, if the model is told to transcribe Spanish, audio could be transcribed “si”, but if told English, it could be transcribed “C”.
Although prompting is a beta feature, we’ve found good results when you build off of the default prompt — which is exactly what we do here for adding language information by prepending Transcribe <language>. to the default prompt.
Our team is running evaluations to determine the best method for attaching this context to the prompt, and we will update this section with the best methods. So far, we have seen that prepending language information with Transcribe <language>. to the default prompt improves the output:
If you have multiple languages, append all languages like Transcribe multilingual conversation in English, Spanish, and German.
Supported languages and regional dialects
Universal-3 Pro Streaming supports 6 languages with out-of-the-box recognition of regional dialects and local speech variants. See the Supported languages page for the full language list and dialect reference.
Updating configuration mid-stream
You can update configuration during an active streaming session using UpdateConfiguration. This applies changes without needing to reconnect. The recommended approach is to dynamically update keyterms_prompt based on the current stage of your voice agent flow — if you expect certain answers or terminology at a specific stage, proactively add those as keyterms so the model recognizes them accurately.
For example, if your voice agent is currently asking for the caller’s name and date of birth, send the expected terms for that stage:
Then, when the conversation moves to a different stage (e.g., medical intake), update with the relevant terms:
You can also update prompt, max_turn_silence, min_turn_silence, interruption_delay, or any combination at the same time:
Common reasons to update configuration mid-stream:
keyterms_prompt— Dynamically add terms relevant to the current stage of your voice agent flow. This is the most effective way to improve recognition accuracy mid-stream. See Keyterms prompting for details.prompt— Pass updated behavioral or formatting instructions into the STT stream.max_turn_silence— Increase for moments where you’d expect a longer pause, such as when a caller is reading out a credit card number, ID number, or address. Decrease it again afterward to resume snappier turn detection.min_turn_silence— Tune how quickly speculative EOT checks fire. Lower values produce faster partials for eager LLM inference, while higher values reduce entity splitting for utterances with numbers or proper nouns.interruption_delay— Tune how quickly the first partial is emitted. Lower values (e.g.0) produce faster TTFT for aggressive barge-in detection; higher values (e.g.500–1000) produce more confident first partials. See Tuning early partial timing for details.continuous_partials— Toggle steady-cadence partial emission on or off mid-session. Useful when switching between interaction modes where you need more frequent feedback for some turns but not others.
Python
JavaScript
Keep alive
KeepAlive messages are not required. By default, sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached.
KeepAlive is only relevant if you have configured the inactivity_timeout connection parameter, which closes the session after a period of no audio or messages being sent. If you are using inactivity_timeout and want to keep the session open during periods where no audio is being sent, send a KeepAlive message to reset the inactivity timer: