ModelsUniversal-3 Pro Streaming

Universal-3 Pro Streaming: Message Sequence Breakdown

For a description of each message field, refer to our Turn object explanation.

Understanding transcript vs utterance

Before walking through the message sequence, it’s important to understand the difference between the transcript and utterance fields:

  • transcript — The full transcript of the current turn up to this point in time.
  • utterance — Only populated on the end_of_turn: true message, where it always equals transcript. On all other Turn messages, utterance is an empty string "".

Key takeaway: For Universal-3 Pro Streaming, you can always use transcript — the utterance field provides no additional information beyond what transcript already contains. This field exists for API consistency with Universal-Streaming, where utterance boundaries can fire independently of turn boundaries, typically for the purposes of eager LLM inference.

Universal-3 Pro Streaming handles message sequences differently from Universal Streaming. Instead of emitting word-by-word partial transcripts as audio is processed, Universal-3 Pro Streaming produces transcripts only during silence periods. Key differences include:

  • Partials only during silence — transcripts are emitted when the speaker pauses, not on every audio frame.
  • Formatting is built inturn_is_formatted is true on end-of-turn transcripts. There is no separate formatting step.
  • Punctuation-based turn detection — turns end when terminal punctuation (. ? !) is detected, not based on a confidence threshold.
  • end_of_turn_confidence is always 1 when triggered by terminal punctuation.

For this example, we walk through a user saying: My name is Sonny.

The speaker pauses briefly mid-sentence (after “is”), producing a partial transcript, then finishes the sentence, producing a final end-of-turn transcript.

Session initialization

When the session begins, you receive a Begin message with the session ID and expiration time.

1{
2 "type": "Begin",
3 "id": "3207b601-2054-48df-ba77-8784dfcf9fb8",
4 "expires_at": 1772570132
5}

Speech detected

Before any Turn messages are sent, the server sends a SpeechStarted message indicating that speech has been detected. The timestamp field indicates when the speech was detected, in milliseconds relative to the beginning of the audio stream. The confidence field is the VAD model’s confidence score that speech has started.

1{
2 "type": "SpeechStarted",
3 "timestamp": 1216,
4 "confidence": 0.987654
5}

Partial transcript

The speaker says “My name is” and pauses briefly. Because the speaker has stopped talking but no terminal punctuation has been detected, Universal-3 Pro Streaming emits a partial transcript.

Notice that:

  • end_of_turn is false — the turn has not ended yet.
  • turn_is_formatted is false — this is not a finalized transcript.
  • end_of_turn_confidence is 0 — no terminal punctuation detected.
  • All words have word_is_final: false — the transcript may be revised in the final message.
  • The transcript ends with an em dash (), indicating the utterance is incomplete.
  • The utterance field is an empty string because the turn has not ended. Use transcript to access the current partial text.
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "My name is—",
6 "end_of_turn_confidence": 0,
7 "words": [
8 {
9 "start": 1216,
10 "end": 1627,
11 "text": "My",
12 "confidence": 0.956314,
13 "word_is_final": false
14 },
15 {
16 "start": 1668,
17 "end": 2490,
18 "text": "name",
19 "confidence": 0.999393,
20 "word_is_final": false
21 },
22 {
23 "start": 2531,
24 "end": 3067,
25 "text": "is—",
26 "confidence": 0.753325,
27 "word_is_final": false
28 }
29 ],
30 "utterance": "",
31 "type": "Turn"
32}

Each silence period produces at most one partial. If the speaker continues pausing without finishing the sentence, no additional partial is emitted until new speech is detected.

End of turn (Final transcript)

The speaker continues and says “Sonny.” — completing the sentence with a period. Universal-3 Pro Streaming detects the terminal punctuation and ends the turn with a fully formatted final transcript.

Notice how the final transcript differs from the partial:

  • end_of_turn is now true — the turn has ended.
  • turn_is_formatted is true — this is a finalized, formatted transcript.
  • end_of_turn_confidence is 1 — terminal punctuation triggered the end of turn.
  • All words now have word_is_final: true — the transcript is final and will not be revised.
  • The word timestamps and confidences have been refined compared to the partial.
  • The utterance field now contains the complete finalized text.
  • The incomplete “is—” from the partial has been resolved to “is” and “Sonny.” in the final transcript.
1{
2 "turn_order": 0,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "My name is Sonny.",
6 "end_of_turn_confidence": 1,
7 "words": [
8 {
9 "start": 1216,
10 "end": 1635,
11 "text": "My",
12 "confidence": 0.956583,
13 "word_is_final": true
14 },
15 {
16 "start": 1676,
17 "end": 2515,
18 "text": "name",
19 "confidence": 0.999199,
20 "word_is_final": true
21 },
22 {
23 "start": 2556,
24 "end": 2975,
25 "text": "is",
26 "confidence": 0.999535,
27 "word_is_final": true
28 },
29 {
30 "start": 3016,
31 "end": 4155,
32 "text": "Sonny.",
33 "confidence": 0.316031,
34 "word_is_final": true
35 }
36 ],
37 "utterance": "My name is Sonny.",
38 "type": "Turn"
39}

Unlike Universal Streaming, there is no separate formatting message. The end-of-turn transcript is always formatted.

Session termination

To end a session, the client must send a Terminate message. The server then responds with a Termination message containing the total audio and session durations, and closes the connection.

Client sends:

1{ "type": "Terminate" }

Server responds:

1{
2 "type": "Termination",
3 "audio_duration_seconds": 13,
4 "session_duration_seconds": 13
5}

Always terminate sessions explicitly. Sessions that are not terminated remain open and continue to accrue charges until the server auto-closes them after 3 hours (error code 3008). See Common errors for more details.

Summary

The complete message flow for this example is:

  1. Begin — session initialized
  2. SpeechStarted — speech detected at 1216ms
  3. Turn (partial) — speaker pauses mid-sentence; end_of_turn: false, turn_is_formatted: false
  4. Turn (final) — speaker finishes with terminal punctuation; end_of_turn: true, turn_is_formatted: true
  5. Termination — session ended

For more details on how partials work and how to tune turn detection timing, see Turn Detection and Partials.

Comparison with Universal Streaming

BehaviorUniversal-3 Pro StreamingUniversal Streaming
Partial frequencyAt most one per silence periodEvery audio frame (word-by-word)
FormattingBuilt in to every end-of-turn transcriptSeparate turn_is_formatted message when format_turns=true
Turn detectionPunctuation-based (min_turn_silence / max_turn_silence)Confidence-based (end_of_turn_confidence_threshold)
end_of_turn_confidenceAlways 1 when triggered by punctuationVaries based on model confidence
Words in partialsAll word_is_final: falseMix of true and false as words are finalized incrementally

For the Universal Streaming message sequence, see Message Sequence.