Universal-3 Pro Streaming Partial Transcripts and Turn Detection

Overview

Traditional streaming models (like Universal-Streaming-English and Universal-Streaming-Multilingual) emit partials word-by-word as audio is processed. Each word can be revised until it’s marked final, after which it’s then immutable.

Universal-3 Pro takes a different approach: an early partial is emitted after 750ms of continuous speech, followed by silence-based partials as the speaker pauses. For long, uninterrupted turns, you can also opt in to continuous_partials for a steady stream of mid-turn partials regardless of silence. Each partial is a stable, fully transcribed segment rather than an incremental word-by-word update. All words in partials are marked word_is_final: false.

While the segments are stable, the final end-of-turn transcript may differ from earlier partials as the model refines its output with full turn context. On the final end-of-turn transcript, all words are marked word_is_final: true.

Universal-3 Pro partials

U3 Pro emits partials in three ways:

Early partial (during continuous speech)

When a speaker is talking continuously without pausing, an early partial is emitted after 750ms of continuous speech by default. This provides a transcript signal for barge-in and speculative inference without waiting for the speaker to pause. If the first attempt returns empty, it retries at 1500ms, 2250ms, and so on until text is produced. Only one early partial is emitted per turn, but additional partials can be produced when the speaker pauses.

You can tune the early partial timing with the interruption_delay connection parameter (range: 0–1000ms, default: 500ms). The server adds a minimum of 300ms on top, so interruption_delay: 0 produces the first partial at ~300ms and interruption_delay: 500 (default) produces it at ~800ms. Lower values give faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. See Tuning early partial timing for full configuration details.

Silence-based partials

U3 Pro uses a punctuation-based turn detection system. When the speaker pauses, the model transcribes the buffered audio and checks for terminal punctuation (. ? !):

No terminal punctuation: a partial is emitted (end_of_turn: false) and the turn continues waiting until speech continues or max_turn_silence is reached.
Terminal punctuation found: the turn ends and is emitted as a final transcript (end_of_turn: true).

This is controlled by two parameters:

Parameter	Default	Description
`min_turn_silence`	`100` ms	Silence duration before a speculative end-of-turn (EOT) check fires.
`max_turn_silence`	`1000` ms	Maximum silence before a turn is forced to end.

Each period of silence can produce at most one partial. If the speaker pauses, resumes, and pauses again, each period of silence can potentially trigger a new partial.

See Configuring turn detection for full turn detection parameter and configuration details.

Continuous partials

For long, uninterrupted turns, such as a caller reading out a credit card number, address, or giving a detailed explanation, silence-based partials may not fire often enough for your downstream consumers (LLMs, UI, eager inference) to keep up. Enable the continuous_partials connection parameter to receive a steady stream of non-final transcripts approximately every 3 seconds while speech continues, regardless of silence.

Each continuous partial is non-final (end_of_turn: false) and covers the full transcript for the current turn so far. The first early partial at 750ms is unaffected, and the final end-of-turn transcript is emitted as normal once the turn ends.

You can also toggle continuous_partials on or off mid-session via UpdateConfiguration:

1 { "type": "UpdateConfiguration", "continuous_partials": true }

See Continuous partials on the Universal-3 Pro Streaming page for the full connection parameter and setup details.

Real-world example

This is an example of what partials might look like in a voice agent scenario where a user is reading out a credit card number:

"Yeah my credit card number is..."  [end_of_turn: false] (PARTIAL)
"One moment..."  [end_of_turn: false] (PARTIAL)
"Yeah my credit card number is, one moment, it's (555) 555-5555."  [end_of_turn: true] (FINAL)

Speculative inference

When receiving a Universal-3 Pro Turn event, use end_of_turn to determine the transcript’s finality:

If end_of_turn is false (partial):

Begin speculative (also known as eager or preemptive) LLM inference
Warm TTS or prepare context

If end_of_turn is true (final):

Commit to full LLM + TTS generation

This preserves the speculative generation pattern you may already be using with word-by-word transcripts, but provides more stable and accurate segments while still giving your LLM early signals to start preparing a response.

Advantages over traditional streaming partials

Fewer, higher-quality partials

Traditional streaming models emit a partial on every audio frame, frequently revising previous words.

U3 Pro emits an early partial after 750ms of continuous speech, then additional partials during silence periods. Each one is processed by a full speech LLM rather than a lightweight RNN-T. This means fewer partials, but ones that are significantly more accurate.

Each partial contains the full cumulative transcription of the turn so far. Earlier words may be refined as more context becomes available, but updates only happen during silence (not on every frame), so the transcript is typically far more stable than traditional streaming models.

Last word accuracy

Speculative inference based on noisy partials can be counterproductive. The final word of a turn often carries critical semantic weight:

“I want to cancel.” (word-by-word, wrong)
“I want to continue.” (full partial after silence, correct)

Getting a high-accuracy segment with a slight delay is often more valuable than getting a lower-accuracy partial a few ms earlier.

Latency performance

After silence detection:

Metric	Latency
P50 inference latency	~121ms
P90 inference latency	~212ms

This makes U3 Pro competitive with, or faster than, many traditional streaming-partial pipelines. The speech-end to transcript-available window remains very fast.

Latency vs. entity splitting trade-off

Setting min_turn_silence too low can split entities like phone numbers and emails for speakers with slow speech patterns. The accuracy is often still high enough for LLMs to piece together the broken entities, but we recommend testing carefully with your use case.

Setting max_turn_silence too low can have the same impact, but entity splitting is less likely since max_turn_silence is typically a greater value than min_turn_silence and a forced end-of-turn only triggers when terminal punctuation is not detected. If you have audio with very long (>1s) pauses and you’d like to keep these utterances as a single turn, you may want to increase max_turn_silence to avoid cutting off the turn too early.

Tuning for your use case

For eager LLM inference on partials, we recommend setting min_turn_silence to 100 (default value).

You can also adjust min_turn_silence (and potentially max_turn_silence for very long pauses) for specific moments mid-stream via UpdateConfiguration. For example, increase it when a caller is about to read out a credit card, ID number, or email, and you’d prefer to wait for a longer silence before checking for an end of turn and potentially emitting a partial.

1 // LLM detects it's asking for a long utterance (e.g., credit card number)
2 { "type": "UpdateConfiguration", "min_turn_silence": 1000 }

Then reset it after the user responds:

1 // User has responded, restore default turn detection
2 { "type": "UpdateConfiguration", "min_turn_silence": 100 }

A clean way to implement this is by giving your LLM a tool call:

1 DEFAULT_MIN_TURN_SILENCE = 100  # your preferred default (ms)
2 EXTENDED_MIN_TURN_SILENCE = 1000  # your preferred extended value (ms)
3 
4 def dynamically_set_turn_silence(ws, min_turn_silence_ms: int):
5     f"""Adjust min_turn_silence on the STT stream.
6     Use {EXTENDED_MIN_TURN_SILENCE} when expecting long utterances (credit cards, phone numbers).
7     Use {DEFAULT_MIN_TURN_SILENCE} to restore normal turn detection."""
8 
9     ws.send(json.dumps({
10         "type": "UpdateConfiguration",
11         "min_turn_silence": min_turn_silence_ms
12     }))

See Updating configuration mid-stream for more details.

1	// LLM detects it's asking for a long utterance (e.g., credit card number)
2	{ "type": "UpdateConfiguration", "min_turn_silence": 1000 }

1	// User has responded, restore default turn detection
2	{ "type": "UpdateConfiguration", "min_turn_silence": 100 }

1	DEFAULT_MIN_TURN_SILENCE = 100 # your preferred default (ms)
2	EXTENDED_MIN_TURN_SILENCE = 1000 # your preferred extended value (ms)
3
4	def dynamically_set_turn_silence(ws, min_turn_silence_ms: int):
5	f"""Adjust min_turn_silence on the STT stream.
6	Use {EXTENDED_MIN_TURN_SILENCE} when expecting long utterances (credit cards, phone numbers).
7	Use {DEFAULT_MIN_TURN_SILENCE} to restore normal turn detection."""
8
9	ws.send(json.dumps({
10	"type": "UpdateConfiguration",
11	"min_turn_silence": min_turn_silence_ms
12	}))