Sequence diagram
Messages at a glance
Client → server| Message | Frame type | Purpose |
|---|---|---|
| Audio | Binary | Raw audio bytes (PCM, no JSON or base64 wrapper). Send ~50 ms chunks. |
UpdateConfiguration | Text (JSON) | Change transcription settings mid-session. |
ForceEndpoint | Text (JSON) | Immediately end the current turn. |
KeepAlive | Text (JSON) | Reset the inactivity timer (only needed with inactivity_timeout). |
Terminate | Text (JSON) | End the session. |
| Message | When | Purpose |
|---|---|---|
Begin | Once, on connect | Session ID, expiration time, and the configuration the server actually applied. |
SpeechStarted | Per turn (Universal-3 Pro only) | Speech detected; precedes the turn’s first Turn message. |
Turn | Repeatedly | Partial and final transcripts; see the walkthrough below. |
SpeakerRevision | If speaker_labels=true | Revised speaker labels for earlier turns. |
Termination | Once, last message | Session totals; nothing follows it. |
Error | On failure, before close | Error code and detail; the connection then closes. |
Session initialization
When the session begins, you receive aBegin message with the session ID, expiration time, and the configuration the server applied to your session.
expires_at is a Unix timestamp (seconds). When it is reached, the server closes the session with error 3008.
Sending audio
Audio is sent as binary WebSocket frames containing raw audio bytes in the encoding and sample rate you set at connection time (default: 16 kHz, 16-bit mono PCM). Do not wrap audio in JSON and do not base64-encode it. A text frame that isn’t valid JSON is rejected and closes the session. Send audio in chunks of roughly 50 ms (800 samples at 16 kHz). You can send audio faster than real time (for example when streaming from a file), but throughput is throttled at ~1.25× real time. If more than five minutes of audio is buffered ahead of processing, the session closes with error3007.
- Python
- Python SDK
- Javascript
- JavaScript SDK
Speech started
Immediately before the firstTurn message of each turn, the server sends a SpeechStarted message. The timestamp field is the start of the turn, in milliseconds relative to the beginning of the audio stream. The confidence field is the average word confidence of the initial transcript.
Universal Streaming skips
SpeechStarted and goes straight to the first Turn.Partial transcript
As the speaker is talking, the server emits one or moreTurn messages with end_of_turn: false. These are partial transcripts.
End of turn
When the turn ends, the server emits aTurn message with end_of_turn: true and turn_is_formatted: true. This is the last message for this turn_order and the transcript is fully formatted.
Universal Streaming differs. By default (
format_turns=false), the final arrives with turn_is_formatted: false. With format_turns=true, you receive two end_of_turn: true messages for the same turn_order: the unformatted final first, then a formatted final right after. Treat a turn as complete only when both end_of_turn and turn_is_formatted are true, or you’ll process every turn twice.Universal Streaming end-of-turn examples
Universal Streaming end-of-turn examples
Default (With
format_turns=false) — single unformatted final:format_turns=true — the unformatted final above, then a formatted final right after:Forcing an endpoint
To end the current turn immediately, for example when your own VAD or a push-to-talk button decides the user is done, send aForceEndpoint message:
- Python
- Python SDK
- Javascript
- JavaScript SDK
end_of_turn: true message(s) for the current turn right away, without waiting for silence or punctuation.
Updating configuration mid-session
You can change transcription settings at any point afterBegin without reconnecting. Only the fields you include are changed:
- Python
- Python SDK
- Javascript
- JavaScript SDK
mode, prompt, and agent_context are Universal-3 Pro only). See Updating configuration mid-stream for the full list.
Keep alive
KeepAlive messages are not required. By default, sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached.
KeepAlive is only relevant if you have configured the inactivity_timeout connection parameter, which closes the session after a period of no audio or messages being sent. If you are using inactivity_timeout and want to keep the session open during periods where no audio is being sent, send a KeepAlive message to reset the inactivity timer:
- Python
- Python SDK
- Javascript
- JavaScript SDK
3006 and the message Session terminated due to inactivity: No messages received for N seconds.
Ordering guarantees.
turn_order is monotonically increasing, and all messages for a turn are delivered before the first message of the next turn. Within a turn, each Turn message supersedes the previous one. Render the latest transcript; do not append. A turn is complete on the message where both end_of_turn and turn_is_formatted are true.Session termination
To end a session, send aTerminate message:
- Python
- Python SDK
- Javascript
- JavaScript SDK
Terminate, keep reading from the WebSocket until you receive the Termination message. The server first flushes any in-flight messages, which can include:
- the final (and formatted)
Turnfor audio you already sent; - on Universal Streaming, a closing
Turnwith an emptytranscriptfor an unfinished turn; - a
SpeakerRevisionifspeaker_labels=true. The end-of-session refinement adds approximately 400 ms of latency at session close. See Revised speaker labels for the message schema and consumption guidance.
Terminate silently discards your last transcript.
The Termination message is always the last message:
audio_duration_secondsis the total audio processed in the session.session_duration_secondsis the total wall-clock time the connection was open. This is the duration the session is billed on.
Termination, no further messages are sent and the server closes the connection with code 1000.
Errors
If the session fails at any point, the server sends anError message as a text frame and then closes the connection: