Universal-3 Pro: Message Sequence Breakdown
For a description of each message field, refer to our Turn object explanation.
Universal-3 Pro handles message sequences differently from Universal Streaming. Instead of emitting word-by-word partial transcripts as audio is processed, Universal-3 Pro produces transcripts only during silence periods. Key differences include:
- Partials only during silence — transcripts are emitted when the speaker pauses, not on every audio frame.
- Formatting is built in —
turn_is_formattedistrueon end-of-turn transcripts. There is no separate formatting step. - Punctuation-based turn detection — turns end when terminal punctuation (
.?!) is detected, not based on a confidence threshold. end_of_turn_confidenceis always1when triggered by terminal punctuation.
For this example, we walk through a user saying:
My name is Sonny.
The speaker pauses briefly mid-sentence (after “is”), producing a partial transcript, then finishes the sentence, producing a final end-of-turn transcript.
Session initialization
When the session begins, you receive a Begin message with the session ID and expiration time.
Speech detected
Before any Turn messages are sent, the server sends a SpeechStarted message indicating that speech has been detected. The timestamp field indicates when the speech was detected, in milliseconds relative to the beginning of the audio stream.
Partial transcript
The speaker says “My name is” and pauses briefly. Because the speaker has stopped talking but no terminal punctuation has been detected, Universal-3 Pro emits a partial transcript.
Notice that:
end_of_turnisfalse— the turn has not ended yet.turn_is_formattedisfalse— this is not a finalized transcript.end_of_turn_confidenceis0— no terminal punctuation detected.- All words have
word_is_final: false— the transcript may be revised in the final message. - The
transcriptends with an em dash (—), indicating the utterance is incomplete. - The
utterancefield is an empty string because the turn has not ended.
Each silence period produces at most one partial. If the speaker continues pausing without finishing the sentence, no additional partial is emitted until new speech is detected.
End of turn (Final transcript)
The speaker continues and says “Sonny.” — completing the sentence with a period. Universal-3 Pro detects the terminal punctuation and ends the turn with a fully formatted final transcript.
Notice how the final transcript differs from the partial:
end_of_turnis nowtrue— the turn has ended.turn_is_formattedistrue— this is a finalized, formatted transcript.end_of_turn_confidenceis1— terminal punctuation triggered the end of turn.- All words now have
word_is_final: true— the transcript is final and will not be revised. - The word timestamps and confidences have been refined compared to the partial.
- The
utterancefield now contains the complete finalized text. - The incomplete “is—” from the partial has been resolved to “is” and “Sonny.” in the final transcript.
Unlike Universal Streaming, there is no separate formatting message. The end-of-turn transcript is always formatted.
Session termination
When the session ends, a Termination message is sent with the total audio and session durations.
Summary
The complete message flow for this example is:
- Begin — session initialized
- SpeechStarted — speech detected at 1216ms
- Turn (partial) — speaker pauses mid-sentence;
end_of_turn: false,turn_is_formatted: false - Turn (final) — speaker finishes with terminal punctuation;
end_of_turn: true,turn_is_formatted: true - Termination — session ended
For more details on how partials work and how to tune turn detection timing, see Turn Detection and Partials.
Comparison with Universal Streaming
For the Universal Streaming message sequence, see Message Sequence.