Messages

{
  "type": "UpdateConfiguration",
  "prompt": "Transcribe product names accurately.",
  "keyterms_prompt": [
    "AssemblyAI",
    "Krabby Patty"
  ],
  "min_turn_silence": 700,
  "max_turn_silence": 1600,
  "agent_context": "Sure — what date would you like to book?"
}

{
  "type": "Begin",
  "id": "3207b601-2054-48df-ba77-8784dfcf9fb8",
  "expires_at": 1772570132,
  "configuration": {
    "model": "universal-3-5-pro",
    "mode": "balanced",
    "api_version": "2025-05-12",
    "speaker_labels": false,
    "redact_pii": false,
    "filter_profanity": false,
    "domain": null,
    "voice_focus": null
  }
}

{
  "type": "Turn",
  "turn_order": 0,
  "turn_is_formatted": true,
  "end_of_turn": true,
  "transcript": "Hello world.",
  "end_of_turn_confidence": 1,
  "words": [
    {
      "text": "Hello",
      "start": 0,
      "end": 500,
      "confidence": 0.99
    },
    {
      "text": "world.",
      "start": 500,
      "end": 1000,
      "confidence": 0.98
    }
  ]
}

{
  "type": "SpeakerRevision",
  "revisions": [
    {
      "turn_order": 3,
      "speaker_label": "B",
      "words": [
        {
          "text": "Hello",
          "speaker": "B",
          "start": 1200,
          "end": 1450
        },
        {
          "text": "there.",
          "speaker": "B",
          "start": 1450,
          "end": 1780
        }
      ]
    },
    {
      "turn_order": 7,
      "speaker_label": "A",
      "words": [
        {
          "text": "Got it.",
          "speaker": "A",
          "start": 4100,
          "end": 4520
        }
      ]
    }
  ]
}

{
  "type": "LLMGatewayResponse",
  "turn_order": 0,
  "transcript": "Hello world.",
  "data": {
    "request_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "Hello! How can I help?"
        },
        "finish_reason": "stop"
      }
    ],
    "usage": {
      "input_tokens": 12,
      "output_tokens": 8,
      "total_tokens": 20,
      "prompt_tokens_details": {},
      "completion_tokens_details": {}
    },
    "request": {},
    "response_time": 123456789
  }
}

Streaming API

Streaming WebSocket API

Stream audio and receive real-time transcription results.

Endpoints

WebSocket URL	Description
`wss://streaming.assemblyai.com/v3/ws`	Global (default). Latency-optimized — automatically routes each connection to the nearest region.
`wss://streaming.us.assemblyai.com/v3/ws`	US data residency. Audio and transcription data never leaves the US.
`wss://streaming.eu.assemblyai.com/v3/ws`	EU data residency. Audio and transcription data never leaves the EU.

Use a data residency endpoint for compliance requirements; see Cloud endpoints & data residency.

Headers

Header	Description
`Authorization`	Required. Your AssemblyAI API key (no `Bearer` prefix). Required on every connection unless you authenticate with the `token` query parameter instead.
`AssemblyAI-Version`	Optional. API version pin. Defaults to the latest version.

Browsers cannot set headers on a WebSocket connection. In that case, generate a temporary token server-side and pass it via the token query parameter instead. Never expose your permanent API key in a URL or in client-side code.

Query parameters

All query parameters below are optional — a connection with only authentication set uses the default speech model (universal-3-5-pro) at 16 kHz pcm_s16le. Parameters apply to every model unless their description says otherwise.

WSS

Messages

{
  "type": "UpdateConfiguration",
  "prompt": "Transcribe product names accurately.",
  "keyterms_prompt": [
    "AssemblyAI",
    "Krabby Patty"
  ],
  "min_turn_silence": 700,
  "max_turn_silence": 1600,
  "agent_context": "Sure — what date would you like to book?"
}

{
  "type": "Begin",
  "id": "3207b601-2054-48df-ba77-8784dfcf9fb8",
  "expires_at": 1772570132,
  "configuration": {
    "model": "universal-3-5-pro",
    "mode": "balanced",
    "api_version": "2025-05-12",
    "speaker_labels": false,
    "redact_pii": false,
    "filter_profanity": false,
    "domain": null,
    "voice_focus": null
  }
}

{
  "type": "Turn",
  "turn_order": 0,
  "turn_is_formatted": true,
  "end_of_turn": true,
  "transcript": "Hello world.",
  "end_of_turn_confidence": 1,
  "words": [
    {
      "text": "Hello",
      "start": 0,
      "end": 500,
      "confidence": 0.99
    },
    {
      "text": "world.",
      "start": 500,
      "end": 1000,
      "confidence": 0.98
    }
  ]
}

{
  "type": "SpeakerRevision",
  "revisions": [
    {
      "turn_order": 3,
      "speaker_label": "B",
      "words": [
        {
          "text": "Hello",
          "speaker": "B",
          "start": 1200,
          "end": 1450
        },
        {
          "text": "there.",
          "speaker": "B",
          "start": 1450,
          "end": 1780
        }
      ]
    },
    {
      "turn_order": 7,
      "speaker_label": "A",
      "words": [
        {
          "text": "Got it.",
          "speaker": "A",
          "start": 4100,
          "end": 4520
        }
      ]
    }
  ]
}

{
  "type": "LLMGatewayResponse",
  "turn_order": 0,
  "transcript": "Hello world.",
  "data": {
    "request_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "Hello! How can I help?"
        },
        "finish_reason": "stop"
      }
    ],
    "usage": {
      "input_tokens": 12,
      "output_tokens": 8,
      "total_tokens": 20,
      "prompt_tokens_details": {},
      "completion_tokens_details": {}
    },
    "request": {},
    "response_time": 123456789
  }
}

token

type:string

required

Authenticate with a temporary token via query parameter. Use this where request headers cannot be set — for example, browser WebSocket connections — and prefer the Authorization header everywhere else. Generate the temporary token server-side; never expose your permanent API key in a URL.

speech_model

type:enum

required

The speech model to use for the session. See Select the speech model for the differences between models.

Available options: universal-3-5-pro, universal-streaming-english, universal-streaming-multilingual

language_codes

type:string

required

Steers transcription toward a set of languages by biasing output toward them on a per-token basis while still allowing native code-switching among them. Pass a list of language codes for the languages you expect (for example, ["en", "es"]), or a single-element list (for example, ["es"]) for a monolingual session. When unset, no steering is applied and the model code-switches natively across all of its supported languages. This is distinct from language_detection, which only controls whether the detected language is reported. Accepted codes: en, es, fr, de, it, pt, tr, nl, sv, no, da, fi, hi, vi, ar, he, ja, zh. Universal-3.5 Pro Streaming only.

language_detection

type:enum

required

Whether to return language_code and language_confidence in turn messages. Available on Universal-3.5 Pro Streaming (which natively code-switches across all of its supported languages by default) and on Universal Streaming Multilingual only.

Available options: true, false

domain

type:enum

required

Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to "medical-v1" to enable Medical Mode for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (en), Spanish (es), German (de), French (fr). If used with an unsupported language, the parameter is ignored and a warning is returned.

Available options: medical-v1

mode

type:enum

required

Latency and accuracy preset that controls the model's turn-detection and partial-emission defaults. max_accuracy favors transcription quality, min_latency favors speed, and balanced trades off between the two. When omitted, the server applies its own preset, which determines the defaults for the mode-dependent fields min_turn_silence and interruption_delay. Universal-3.5 Pro Streaming only.

Available options: max_accuracy, min_latency, balanced

encoding

type:enum

required

Encoding of the audio stream. pcm_s16le and pcm_mulaw are raw PCM formats. opus is raw Opus packets, where each binary WebSocket message must contain exactly one Opus packet. ogg_opus is an Ogg-encapsulated Opus byte stream (the format produced by ffmpeg, gstreamer, opusenc, and browser MediaRecorder); binary WebSocket messages can be arbitrary chunks of the stream. For both Opus encodings, sample_rate is ignored — the stream is self-describing.

Available options: pcm_s16le, pcm_mulaw, opus, ogg_opus

sample_rate

type:string

required

Sample rate of the audio stream in Hz. Accepts any integer from 8000 to 96000. Ignored for Opus encodings (opus, ogg_opus).

min_turn_silence

type:string

required

Silence duration in milliseconds before a speculative end-of-turn check. On Universal-3.5 Pro Streaming the check is punctuation-based (if terminal punctuation is found the turn ends, otherwise a partial is emitted and the turn continues) and the default is mode-dependent, set by the mode preset. On Universal Streaming the check is confidence-based and the default is 400 ms. Clamped to the range 50 to 10000 ms. See Configuring Turn Detection for configuration details.

max_turn_silence

type:string

required

Maximum silence in milliseconds before the turn is forced to end, regardless of punctuation. Defaults are 1536 ms on Universal-3.5 Pro Streaming (768 ms when speaker_labels is enabled) and 1280 ms on Universal Streaming. See Configuring Turn Detection for configuration details.

end_of_turn_confidence_threshold

type:string

required

The confidence threshold (0.0 to 1.0) to use when determining if the end of a turn has been reached. See Turn Detection for configuration details. Universal Streaming (English and Multilingual) only.

vad_threshold

type:string

required

The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. Defaults are 0.2 on Universal-3.5 Pro Streaming (0.5 when speaker_labels is enabled) and 0.4 on Universal Streaming.

interruption_delay

type:string

required

How soon the first partial is emitted in milliseconds. Accepts a value from 0 to 1000. Useful for tuning voice agent barge-in responsiveness or allowing earlier partials for early LLM inference. Larger values are more confident on interruptions, smaller values result in faster time to first partial. The server adds a fixed 256 ms on top of this value, so 0 yields an effective delay of 256 ms and 500 yields 756 ms. The default is mode-dependent and set by the server based on the mode preset. Universal-3.5 Pro Streaming only.

continuous_partials

type:string

required

Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When enabled (default), additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. When disabled, only one early partial is emitted near turn start. The first partial (timed by interruption_delay) is unaffected. When speaker_labels is enabled the server disables continuous partials by default. Universal-3.5 Pro Streaming only.

include_partial_turns

type:string

required

Whether to emit partial transcripts during the turn. When enabled (default), partial transcripts are forwarded as speech is still in progress alongside final turns. When disabled, only final turns (with end_of_turn true) are sent. Defaults to false when redact_pii is enabled, to prevent unredacted partial transcripts from reaching the client; set explicitly to true to override.

format_turns

type:enum

required

Whether to return formatted final transcripts. Universal Streaming (English and Multilingual) only.

Available options: true, false

prompt

type:string

required

A contextual prompt describing what the audio is about, such as its domain, scenario, or full conversation details. The model uses this context to better recognize the vocabulary it makes likely, for example priming medical terminology for a cardiology call. It carries context about your audio, not transcription instructions, so formatting or behavioral commands (such as punctuation rules) are not supported. Maximum 1750 characters. Universal-3.5 Pro Streaming only. See Prompting and keyterms for details.

keyterms_prompt

type:string

required

A list of words and phrases to improve recognition accuracy for. Maximum 100 terms. See Keyterms Prompting for more details.

agent_context

type:string

required

Your voice agent's spoken text (TTS reply). The model uses this as context for the next user turn, which improves accuracy on short or ambiguous replies and on spelled-out entities like emails or IDs. Set at connection time to seed the model with your agent's opening greeting, and/or update mid-stream via UpdateConfiguration after each agent reply. Each UpdateConfiguration replaces the previously set value. Maximum 1750 characters per value. Universal-3.5 Pro Streaming only. See Context Carryover.

previous_context_n_turns

type:string

required

Advanced. Maximum number of prior conversation entries (user transcripts and any agent_context values) carried forward as context for each transcription. Range 0 to 100. Set to 0 to disable automatic context carryover entirely. When unset, the server default applies (currently 5 turns). Most integrations should leave this unset. Universal-3.5 Pro Streaming only.

speaker_labels

type:enum

required

Whether to enable Streaming Speaker Diarization. When enabled, each Turn event will include a speaker_label field and each final word in the words array will include a speaker field for word-level speaker attribution.

Available options: true, false

max_speakers

type:string

required

A hard cap on the number of speaker labels in the audio stream (integer, 1-10). This is a strict limit, not a hint — once it is reached, any additional speakers are merged into the closest existing label rather than given a new one. Give the model a little headroom above the number of speakers you expect; setting it too high can cause over-splitting and return more speakers than are actually present. Only used when speaker_labels is enabled. See Streaming Diarization for more details.

voice_focus

type:enum

required

Enable Voice Focus to isolate the primary voice and suppress background noise before transcription. Set to near-field for close-talking microphones (for example, headsets or phones) or far-field for distant microphones (for example, conference rooms). When unset, Voice Focus is off.

Available options: near-field, far-field

voice_focus_threshold

type:string

required

Controls how aggressively Voice Focus suppresses background audio, from 0.0 (least) to 1.0 (most). Higher values are more aggressive. Requires voice_focus to be set, otherwise a validation error is returned.

redact_pii

type:enum

required

Redact PII from the transcribed text using the Redact PII model, can be true or false. Only applies to final turns. See PII Redaction for more details.

Available options: true, false

redact_pii_policies

type:string

required

The list of PII Redaction policies to enable. Requires redact_pii to be true. See PII redaction for more details.

redact_pii_sub

type:enum

required

The replacement logic for detected PII, can be entity_name or hash. Requires redact_pii to be true. See PII redaction for more details.

Available options: entity_name, hash

filter_profanity

type:enum

required

Filter profanity from the transcribed text, can be true or false. See Profanity Filtering for more details.

Available options: true, false

llm_gateway

type:string

required

JSON-stringified LLM Gateway configuration that processes each finalized turn. Follows the same interface as the Chat Completions endpoint and accepts model, messages, tools, tool_choice, post_processing_steps, and max_tokens. See Apply LLM Gateway to Streaming for the full schema and examples.

inactivity_timeout

type:string

required

Optional time in seconds of inactivity before session is terminated (integer, minimum 5, maximum 3600). If not set, no inactivity timeout is applied.

Audio Data Chunk

type:string

Client sends audio data as raw binary.

Update Streaming Configuration

type:object

Client message to update streaming configuration parameters during an active session.

Force Endpoint

type:object

Client message to manually force an endpoint in the transcription.

Terminate Session (Client Initiated)

type:object

Client message to gracefully terminate the streaming session.

Keep Alive

type:object

Client message to reset the inactivity timeout timer. This is not necessary by default — sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached. This message is only needed if you have set inactivity_timeout and want to keep the session open during periods where no audio is being sent.

Session Begins Confirmation

type:object

Server message indicating the streaming session has successfully started.

Speech Started

type:object

Server message indicating that speech has been detected.

Turn Transcript

type:object

Server message containing a turn-based transcription result.

Revised Speaker Labels

type:object

Server message containing corrected speaker labels for any turns that changed. Emitted as a single message after the client sends Terminate, when speaker_labels is enabled. A session may produce zero or one SpeakerRevision message; if sent, only changed turns are included in the revisions array. See Revised speaker labels for details.

Session Terminated (Server Confirmation)

type:object

Server message confirming session termination with session statistics.

LLM Gateway Response

type:object

Server message containing an LLM Gateway response for a finalized turn.

Generate streaming token

Generate voice agent token

⌘I