Universal-3 Pro Streaming Partial Transcripts and Turn Detection
Universal-3 Pro Streaming Partial Transcripts and Turn Detection
Universal-3 Pro Streaming Partial Transcripts and Turn Detection
Traditional streaming models (like Universal-Streaming-English and Universal-Streaming-Multilingual) emit partials word-by-word as audio is processed. Each word can be revised until it’s marked final, after which it’s then immutable.
Universal-3 Pro takes a different approach: an early partial is emitted after 750ms of continuous speech, followed by silence-based partials as the speaker pauses. For long, uninterrupted turns, you can also opt in to continuous_partials for a steady stream of mid-turn partials regardless of silence. Each partial is a stable, fully transcribed segment rather than an incremental word-by-word update. All words in partials are marked word_is_final: false.
While the segments are stable, the final end-of-turn transcript may differ from earlier partials as the model refines its output with full turn context. On the final end-of-turn transcript, all words are marked word_is_final: true.
U3 Pro emits partials in three ways:
When a speaker is talking continuously without pausing, an early partial is emitted after 750ms of continuous speech by default. This provides a transcript signal for barge-in and speculative inference without waiting for the speaker to pause. If the first attempt returns empty, it retries at 1500ms, 2250ms, and so on until text is produced. Only one early partial is emitted per turn, but additional partials can be produced when the speaker pauses.
You can tune the early partial timing with the interruption_delay connection parameter (range: 0–1000ms, default: 500ms). The server adds a minimum of 300ms on top, so interruption_delay: 0 produces the first partial at ~300ms and interruption_delay: 500 (default) produces it at ~800ms. Lower values give faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. See Tuning early partial timing for full configuration details.
U3 Pro uses a punctuation-based turn detection system. When the speaker pauses, the model transcribes the buffered audio and checks for terminal punctuation (. ? !):
end_of_turn: false) and the turn continues waiting until speech continues or max_turn_silence is reached.end_of_turn: true).This is controlled by two parameters:
Each period of silence can produce at most one partial. If the speaker pauses, resumes, and pauses again, each period of silence can potentially trigger a new partial.
See Configuring turn detection for full turn detection parameter and configuration details.
For long, uninterrupted turns, such as a caller reading out a credit card number, address, or giving a detailed explanation, silence-based partials may not fire often enough for your downstream consumers (LLMs, UI, eager inference) to keep up. Enable the continuous_partials connection parameter to receive a steady stream of non-final transcripts approximately every 3 seconds while speech continues, regardless of silence.
Each continuous partial is non-final (end_of_turn: false) and covers the full transcript for the current turn so far. The first early partial at 750ms is unaffected, and the final end-of-turn transcript is emitted as normal once the turn ends.
You can also toggle continuous_partials on or off mid-session via UpdateConfiguration:
See Continuous partials on the Universal-3 Pro Streaming page for the full connection parameter and setup details.
This is an example of what partials might look like in a voice agent scenario where a user is reading out a credit card number:
When receiving a Universal-3 Pro Turn event, use end_of_turn to determine the transcript’s finality:
If end_of_turn is false (partial):
If end_of_turn is true (final):
This preserves the speculative generation pattern you may already be using with word-by-word transcripts, but provides more stable and accurate segments while still giving your LLM early signals to start preparing a response.
Traditional streaming models emit a partial on every audio frame, frequently revising previous words.
U3 Pro emits an early partial after 750ms of continuous speech, then additional partials during silence periods. Each one is processed by a full speech LLM rather than a lightweight RNN-T. This means fewer partials, but ones that are significantly more accurate.
Each partial contains the full cumulative transcription of the turn so far. Earlier words may be refined as more context becomes available, but updates only happen during silence (not on every frame), so the transcript is typically far more stable than traditional streaming models.
Speculative inference based on noisy partials can be counterproductive. The final word of a turn often carries critical semantic weight:
Getting a high-accuracy segment with a slight delay is often more valuable than getting a lower-accuracy partial a few ms earlier.
After silence detection:
This makes U3 Pro competitive with, or faster than, many traditional streaming-partial pipelines. The speech-end to transcript-available window remains very fast.
Setting min_turn_silence too low can split entities like phone numbers and emails for speakers with slow speech patterns. The accuracy is often still high enough for LLMs to piece together the broken entities, but we recommend testing carefully with your use case.
Setting max_turn_silence too low can have the same impact, but entity splitting is less likely since max_turn_silence is typically a greater value than min_turn_silence and a forced end-of-turn only triggers when terminal punctuation is not detected. If you have audio with very long (>1s) pauses and you’d like to keep these utterances as a single turn, you may want to increase max_turn_silence to avoid cutting off the turn too early.
For eager LLM inference on partials, we recommend setting min_turn_silence to 100 (default value).
You can also adjust min_turn_silence (and potentially max_turn_silence for very long pauses) for specific moments mid-stream via UpdateConfiguration. For example, increase it when a caller is about to read out a credit card, ID number, or email, and you’d prefer to wait for a longer silence before checking for an end of turn and potentially emitting a partial.
Then reset it after the user responds:
A clean way to implement this is by giving your LLM a tool call:
See Updating configuration mid-stream for more details.