Universal-3 Pro: Message Sequence Breakdown

For a description of each message field, refer to our Turn object explanation.

Universal-3 Pro handles message sequences differently from Universal Streaming. Instead of emitting word-by-word partial transcripts as audio is processed, Universal-3 Pro produces transcripts only during silence periods. Key differences include:

  • Partials only during silence — transcripts are emitted when the speaker pauses, not on every audio frame.
  • Formatting is built inturn_is_formatted is true on end-of-turn transcripts. There is no separate formatting step.
  • Punctuation-based turn detection — turns end when terminal punctuation (. ? !) is detected, not based on a confidence threshold.
  • end_of_turn_confidence is always 1 when triggered by terminal punctuation.

For this example, we walk through a user saying: My name is Sonny.

The speaker pauses briefly mid-sentence (after “is”), producing a partial transcript, then finishes the sentence, producing a final end-of-turn transcript.

Session initialization

When the session begins, you receive a Begin message with the session ID and expiration time.

1{
2 "type": "Begin",
3 "id": "3207b601-2054-48df-ba77-8784dfcf9fb8",
4 "expires_at": 1772570132
5}

Speech detected

Before any Turn messages are sent, the server sends a SpeechStarted message indicating that speech has been detected. The timestamp field indicates when the speech was detected, in milliseconds relative to the beginning of the audio stream.

1{
2 "type": "SpeechStarted",
3 "timestamp": 1216
4}

Partial transcript

The speaker says “My name is” and pauses briefly. Because the speaker has stopped talking but no terminal punctuation has been detected, Universal-3 Pro emits a partial transcript.

Notice that:

  • end_of_turn is false — the turn has not ended yet.
  • turn_is_formatted is false — this is not a finalized transcript.
  • end_of_turn_confidence is 0 — no terminal punctuation detected.
  • All words have word_is_final: false — the transcript may be revised in the final message.
  • The transcript ends with an em dash (), indicating the utterance is incomplete.
  • The utterance field is an empty string because the turn has not ended.
1{
2 "turn_order": 0,
3 "turn_is_formatted": false,
4 "end_of_turn": false,
5 "transcript": "My name is—",
6 "end_of_turn_confidence": 0,
7 "words": [
8 {
9 "start": 1216,
10 "end": 1627,
11 "text": "My",
12 "confidence": 0.956314,
13 "word_is_final": false
14 },
15 {
16 "start": 1668,
17 "end": 2490,
18 "text": "name",
19 "confidence": 0.999393,
20 "word_is_final": false
21 },
22 {
23 "start": 2531,
24 "end": 3067,
25 "text": "is—",
26 "confidence": 0.753325,
27 "word_is_final": false
28 }
29 ],
30 "utterance": "",
31 "type": "Turn"
32}

Each silence period produces at most one partial. If the speaker continues pausing without finishing the sentence, no additional partial is emitted until new speech is detected.

End of turn (Final transcript)

The speaker continues and says “Sonny.” — completing the sentence with a period. Universal-3 Pro detects the terminal punctuation and ends the turn with a fully formatted final transcript.

Notice how the final transcript differs from the partial:

  • end_of_turn is now true — the turn has ended.
  • turn_is_formatted is true — this is a finalized, formatted transcript.
  • end_of_turn_confidence is 1 — terminal punctuation triggered the end of turn.
  • All words now have word_is_final: true — the transcript is final and will not be revised.
  • The word timestamps and confidences have been refined compared to the partial.
  • The utterance field now contains the complete finalized text.
  • The incomplete “is—” from the partial has been resolved to “is” and “Sonny.” in the final transcript.
1{
2 "turn_order": 0,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "My name is Sonny.",
6 "end_of_turn_confidence": 1,
7 "words": [
8 {
9 "start": 1216,
10 "end": 1635,
11 "text": "My",
12 "confidence": 0.956583,
13 "word_is_final": true
14 },
15 {
16 "start": 1676,
17 "end": 2515,
18 "text": "name",
19 "confidence": 0.999199,
20 "word_is_final": true
21 },
22 {
23 "start": 2556,
24 "end": 2975,
25 "text": "is",
26 "confidence": 0.999535,
27 "word_is_final": true
28 },
29 {
30 "start": 3016,
31 "end": 4155,
32 "text": "Sonny.",
33 "confidence": 0.316031,
34 "word_is_final": true
35 }
36 ],
37 "utterance": "My name is Sonny.",
38 "type": "Turn"
39}

Unlike Universal Streaming, there is no separate formatting message. The end-of-turn transcript is always formatted.

Session termination

When the session ends, a Termination message is sent with the total audio and session durations.

1{
2 "type": "Termination",
3 "audio_duration_seconds": 13,
4 "session_duration_seconds": 13
5}

Summary

The complete message flow for this example is:

  1. Begin — session initialized
  2. SpeechStarted — speech detected at 1216ms
  3. Turn (partial) — speaker pauses mid-sentence; end_of_turn: false, turn_is_formatted: false
  4. Turn (final) — speaker finishes with terminal punctuation; end_of_turn: true, turn_is_formatted: true
  5. Termination — session ended

For more details on how partials work and how to tune turn detection timing, see Turn Detection and Partials.

Comparison with Universal Streaming

BehaviorUniversal-3 ProUniversal Streaming
Partial frequencyAt most one per silence periodEvery audio frame (word-by-word)
FormattingBuilt in to every end-of-turn transcriptSeparate turn_is_formatted message after end of turn
Turn detectionPunctuation-based (min_turn_silence / max_turn_silence)Confidence-based (end_of_turn_confidence_threshold)
end_of_turn_confidenceAlways 1 when triggered by punctuationVaries based on model confidence
Words in partialsAll word_is_final: falseMix of true and false as words are finalized incrementally

For the Universal Streaming message sequence, see Message Sequence.