Skip to main content
This page is the canonical reference for the streaming WebSocket protocol: every message the client sends, every message the server emits, and the order they appear in. For the field-level schema of each message, see the Streaming API reference.

Sequence diagram

Messages at a glance

Client → server
MessageFrame typePurpose
AudioBinaryRaw audio bytes (PCM, no JSON or base64 wrapper). Send ~50 ms chunks.
UpdateConfigurationText (JSON)Change transcription settings mid-session.
ForceEndpointText (JSON)Immediately end the current turn.
KeepAliveText (JSON)Reset the inactivity timer (only needed with inactivity_timeout).
TerminateText (JSON)End the session.
Server → client
MessageWhenPurpose
BeginOnce, on connectSession ID, expiration time, and the configuration the server actually applied.
SpeechStartedPer turn (Universal-3 Pro only)Speech detected; precedes the turn’s first Turn message.
TurnRepeatedlyPartial and final transcripts; see the walkthrough below.
SpeakerRevisionIf speaker_labels=trueRevised speaker labels for earlier turns.
TerminationOnce, last messageSession totals; nothing follows it.
ErrorOn failure, before closeError code and detail; the connection then closes.

Session initialization

When the session begins, you receive a Begin message with the session ID, expiration time, and the configuration the server applied to your session.
{
  "type": "Begin",
  "id": "3207b601-2054-48df-ba77-8784dfcf9fb8",
  "expires_at": 1772570132,
  "configuration": {
    "model": "u3-rt-pro",
    "mode": "balanced",
    "api_version": "1.0.0"
  }
}
expires_at is a Unix timestamp (seconds). When it is reached, the server closes the session with error 3008.
Always check that configuration.model matches the speech_model you requested. Unrecognized or misspelled query parameters are ignored rather than rejected, so this echo is the fastest way to catch a typo (for example, speechModel instead of speech_model).

Sending audio

Audio is sent as binary WebSocket frames containing raw audio bytes in the encoding and sample rate you set at connection time (default: 16 kHz, 16-bit mono PCM). Do not wrap audio in JSON and do not base64-encode it. A text frame that isn’t valid JSON is rejected and closes the session. Send audio in chunks of roughly 50 ms (800 samples at 16 kHz). You can send audio faster than real time (for example when streaming from a file), but throughput is throttled at ~1.25× real time. If more than five minutes of audio is buffered ahead of processing, the session closes with error 3007.
import websocket

# audio_chunk is raw PCM bytes (e.g. 1600 bytes = 800 samples ≈ 50 ms at 16 kHz mono)
ws.send(audio_chunk, websocket.ABNF.OPCODE_BINARY)

Speech started

Immediately before the first Turn message of each turn, the server sends a SpeechStarted message. The timestamp field is the start of the turn, in milliseconds relative to the beginning of the audio stream. The confidence field is the average word confidence of the initial transcript.
{
  "type": "SpeechStarted",
  "timestamp": 1216,
  "confidence": 0.987654
}
Universal Streaming skips SpeechStarted and goes straight to the first Turn.

Partial transcript

As the speaker is talking, the server emits one or more Turn messages with end_of_turn: false. These are partial transcripts.
{
  "turn_order": 0,
  "turn_is_formatted": false,
  "end_of_turn": false,
  "transcript": "My name is",
  "end_of_turn_confidence": 0,
  "words": [
    {
      "start": 1216,
      "end": 1627,
      "text": "My",
      "confidence": 0.956314,
      "word_is_final": false
    },
    {
      "start": 1668,
      "end": 2490,
      "text": "name",
      "confidence": 0.999393,
      "word_is_final": false
    },
    {
      "start": 2531,
      "end": 3067,
      "text": "is",
      "confidence": 0.753325,
      "word_is_final": false
    }
  ],
  "utterance": "",
  "type": "Turn"
}
The cadence and shape of partials depends on the model. See Universal-3 Pro Streaming and Universal Streaming for the details of how each model produces partials.

End of turn

When the turn ends, the server emits a Turn message with end_of_turn: true and turn_is_formatted: true. This is the last message for this turn_order and the transcript is fully formatted.
{
  "turn_order": 0,
  "turn_is_formatted": true,
  "end_of_turn": true,
  "transcript": "My name is Sonny.",
  "end_of_turn_confidence": 1,
  "words": [
    {
      "start": 1216,
      "end": 1635,
      "text": "My",
      "confidence": 0.956583,
      "word_is_final": true
    },
    {
      "start": 1676,
      "end": 2515,
      "text": "name",
      "confidence": 0.999199,
      "word_is_final": true
    },
    {
      "start": 2556,
      "end": 2975,
      "text": "is",
      "confidence": 0.999535,
      "word_is_final": true
    },
    {
      "start": 3016,
      "end": 4155,
      "text": "Sonny.",
      "confidence": 0.316031,
      "word_is_final": true
    }
  ],
  "utterance": "My name is Sonny.",
  "type": "Turn"
}
Universal Streaming differs. By default (format_turns=false), the final arrives with turn_is_formatted: false. With format_turns=true, you receive two end_of_turn: true messages for the same turn_order: the unformatted final first, then a formatted final right after. Treat a turn as complete only when both end_of_turn and turn_is_formatted are true, or you’ll process every turn twice.
Default (format_turns=false) — single unformatted final:
{
  "turn_order": 0,
  "turn_is_formatted": false,
  "end_of_turn": true,
  "transcript": "my name is sonny",
  "end_of_turn_confidence": 0.812345,
  "words": [
    { "start": 1216, "end": 1635, "text": "my",    "confidence": 0.956583, "word_is_final": true },
    { "start": 1676, "end": 2515, "text": "name",  "confidence": 0.999199, "word_is_final": true },
    { "start": 2556, "end": 2975, "text": "is",    "confidence": 0.999535, "word_is_final": true },
    { "start": 3016, "end": 4155, "text": "sonny", "confidence": 0.316031, "word_is_final": true }
  ],
  "type": "Turn"
}
With format_turns=true — the unformatted final above, then a formatted final right after:
{
  "turn_order": 0,
  "turn_is_formatted": true,
  "end_of_turn": true,
  "transcript": "My name is Sonny.",
  "end_of_turn_confidence": 0.812345,
  "words": [
    { "start": 1216, "end": 1635, "text": "My",     "confidence": 0.956583, "word_is_final": true },
    { "start": 1676, "end": 2515, "text": "name",   "confidence": 0.999199, "word_is_final": true },
    { "start": 2556, "end": 2975, "text": "is",     "confidence": 0.999535, "word_is_final": true },
    { "start": 3016, "end": 4155, "text": "Sonny.", "confidence": 0.316031, "word_is_final": true }
  ],
  "type": "Turn"
}

Forcing an endpoint

To end the current turn immediately, for example when your own VAD or a push-to-talk button decides the user is done, send a ForceEndpoint message:
ws.send(json.dumps({"type": "ForceEndpoint"}))
The server responds with the end_of_turn: true message(s) for the current turn right away, without waiting for silence or punctuation.

Updating configuration mid-session

You can change transcription settings at any point after Begin without reconnecting. Only the fields you include are changed:
ws.send(json.dumps({
    "type": "UpdateConfiguration",
    "end_of_turn_confidence_threshold": 0.8,
    "min_end_of_turn_silence_when_confident": 400,
    "max_turn_silence": 1500,
    "keyterms_prompt": ["Keanu Reeves", "AssemblyAI"],
}))
There is no acknowledgement message; the new settings apply to audio processed after the update. They do not retroactively affect the turn in progress. Some fields are model-specific (for example mode, prompt, and agent_context are Universal-3 Pro only). See Updating configuration mid-stream for the full list.

Keep alive

KeepAlive messages are not required. By default, sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached. KeepAlive is only relevant if you have configured the inactivity_timeout connection parameter, which closes the session after a period of no audio or messages being sent. If you are using inactivity_timeout and want to keep the session open during periods where no audio is being sent, send a KeepAlive message to reset the inactivity timer:
ws.send(json.dumps({"type": "KeepAlive"}))
If the inactivity timeout elapses, the session closes with error 3006 and the message Session terminated due to inactivity: No messages received for N seconds.
Ordering guarantees. turn_order is monotonically increasing, and all messages for a turn are delivered before the first message of the next turn. Within a turn, each Turn message supersedes the previous one. Render the latest transcript; do not append. A turn is complete on the message where both end_of_turn and turn_is_formatted are true.

Session termination

To end a session, send a Terminate message:
ws.send(json.dumps({"type": "Terminate"}))
After you send Terminate, keep reading from the WebSocket until you receive the Termination message. The server first flushes any in-flight messages, which can include:
  • the final (and formatted) Turn for audio you already sent;
  • on Universal Streaming, a closing Turn with an empty transcript for an unfinished turn;
  • a SpeakerRevision if speaker_labels=true. The end-of-session refinement adds approximately 400 ms of latency at session close. See Revised speaker labels for the message schema and consumption guidance.
Closing the socket as soon as you send Terminate silently discards your last transcript. The Termination message is always the last message:
{
  "type": "Termination",
  "audio_duration_seconds": 13,
  "session_duration_seconds": 14
}
  • audio_duration_seconds is the total audio processed in the session.
  • session_duration_seconds is the total wall-clock time the connection was open. This is the duration the session is billed on.
After Termination, no further messages are sent and the server closes the connection with code 1000.
Always terminate sessions explicitly. Streaming is billed per session. Sessions that are not terminated remain open and continue to accrue charges until the server auto-closes them after 3 hours (error code 3008). See Common errors for more details.

Errors

If the session fails at any point, the server sends an Error message as a text frame and then closes the connection:
{
  "type": "Error",
  "error_code": 3007,
  "error": "Audio transmission rate exceeded: too much audio buffered"
}
See Common errors for the full list of error codes and how to handle reconnection.