Skip to main content

Documentation Index

Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Stream audio and receive real-time transcription results. Fast, cost-effective streaming transcription available in three variants:
  • Universal-Streaming English — the fastest real-time English transcription
  • Universal-Streaming Multilingual — multilingual support (English, Spanish, German, French, Portuguese, and Italian) at the same speed and price
  • Whisper-Streaming — open-source Whisper powered by AssemblyAI’s infrastructure with 99+ languages
To use the EU server for Streaming STT, replace streaming.assemblyai.com with streaming.eu.assemblyai.com.
WSSwss://streaming.assemblyai.com/v3/ws

Authentication

Authenticate by passing your API key in the Authorization header. Alternatively, generate a temporary token and pass it via the token query parameter.
Authorization
string
Use your API key for authentication, or alternatively generate a temporary token and pass it via the token query parameter.

Query parameters

speech_model
string
The speech model used for your Streaming session. Allowed values: universal-streaming-english, universal-streaming-multilingual, whisper-rt.
encoding
string
default:"pcm_s16le"
Encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw.
format_turns
boolean
default:"false"
Whether to return formatted final transcripts. Allowed values: true, false.
inactivity_timeout
integer
Optional time in seconds of inactivity before session is terminated (integer, minimum 5, maximum 3600). If not set, no inactivity timeout is applied.
keyterms_prompt
string
A list of words and phrases to improve recognition accuracy for. See Keyterms Prompting for more details.
language_detection
boolean
default:"false"
Whether to detect the language and return language metadata on utterances and final turns. Only available for the multilingual model. Allowed values: true, false.
max_turn_silence
integer
default:"1280"
The maximum amount of silence in milliseconds allowed in a turn before end of turn is triggered. See Turn Detection for configuration details.
min_turn_silence
integer
default:"400"
The minimum amount of silence in milliseconds required to detect end of turn when confident. See Turn Detection for configuration details.
sample_rate
integer
default:"16000"
Sample rate of the audio stream.
speaker_labels
boolean
default:"false"
Whether to enable Streaming Speaker Diarization. When enabled, each Turn event will include a speaker_label field and each final word in the words array will include a speaker field for word-level speaker attribution. Allowed values: true, false.
max_speakers
integer
The maximum number of speakers expected in the audio stream (integer, 1-10). Setting this can improve speaker label accuracy when you know the number of speakers in advance. Only used when speaker_labels is enabled. See Streaming Diarization for more details.
token
string
API token for authentication (if using a temporary token).
vad_threshold
number
default:"0.4"
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.
end_of_turn_confidence_threshold
number
default:"0.4"
The confidence threshold (0.0 to 1.0) to use when determining if the end of a turn has been reached. See Turn Detection for configuration details.Note: This parameter is only supported for the Universal-streaming model.
domain
string
Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to "medical-v1" to enable Medical Mode for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (en), Spanish (es), German (de), French (fr). If used with an unsupported language, the parameter is ignored and a warning is returned. Allowed values: medical-v1.
language
string
default:"en"
The language of your audio stream. Deprecated. Allowed values: en, multi.
llm_gateway
string
JSON-stringified LLM Gateway configuration that processes each finalized turn. Follows the same interface as the Chat Completions endpoint and accepts model, messages, tools, tool_choice, post_processing_steps, and max_tokens. See Apply LLM Gateway to Streaming for the full schema and examples.

Messages sent by the client

Audio Data Chunk

Client sends audio data as raw binary. Send audio data chunks for transcription. The payload must be of type bytes and contain audio data between 50ms and 1000ms in length. When streaming from a pre-recorded file, pace the chunks at approximately real-time (for example, sleep for the chunk’s duration between sends) — sending chunks in a tight loop can produce inconsistent Turn messages. The payload is raw binary audio data (application/octet-stream), not JSON.
"\x10\x00\x20\x00\x30\x00\x40\x00\x30\x00\x20\x00\x10\x00\x00\x00\xf0\xff\xe0\xff\xd0\xff\xc0\xff"

Update Streaming Configuration

Client message to update streaming configuration parameters during an active session.
type
string
required
Allowed values: UpdateConfiguration.
end_of_turn_confidence_threshold
number
Confidence threshold (0-1) for detecting end of turn. See Turn Detection for configuration details.
min_turn_silence
integer
Minimum silence duration in ms when confident about end of turn. See Turn Detection for configuration details.
max_turn_silence
integer
The maximum amount of silence allowed in a turn before end of turn is triggered. See Turn Detection for configuration details.
{
  "type": "UpdateConfiguration",
  "sample_rate": 16000,
  "encoding": "pcm_s16le",
  "format_turns": true,
  "keyterms_prompt": [
    "AssemblyAI",
    "Universal Streaming"
  ]
}

Force Endpoint

Client message to manually force an endpoint in the transcription.
type
string
required
Allowed values: ForceEndpoint.
{
  "type": "ForceEndpoint"
}

Terminate Session (Client Initiated)

Client message to gracefully terminate the streaming session.
type
string
required
Allowed values: Terminate.
{
  "type": "Terminate"
}

Keep Alive

Client message to reset the inactivity timeout timer. This is not necessary by default — sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached. This message is only needed if you have set inactivity_timeout and want to keep the session open during periods where no audio is being sent.
type
string
required
Allowed values: KeepAlive.
{
  "type": "KeepAlive"
}

Messages received from the server

Session Begins Confirmation

Server message indicating the streaming session has successfully started.
type
string
required
Identifies the type of the message. Allowed values: Begin.
id
string
required
Unique identifier for the streaming session.
expires_at
integer
required
Unix timestamp indicating when the session will expire.
{
  "type": "Begin",
  "id": "b8e7c1a2-4f3d-4e90-9a6b-1c2d3e4f5a6b",
  "expires_at": 1748390400
}

Formatted Turn Result

Server message containing a formatted turn-based transcription result.
type
string
required
Allowed values: Turn.
turn_order
integer
required
Order of this turn in the conversation.
turn_is_formatted
boolean
required
Whether this turn has been formatted.
end_of_turn
boolean
required
Whether this marks the end of a turn. See Turn Detection for more information.
transcript
string
required
Transcript of all finalized words in the turn.
utterance
string
Finalized text at the moment a pause in speech is detected. Empty string on all other Turn messages. A turn can contain multiple utterances.
language_code
string
The language of the turn. Only populated when language detection is enabled and an utterance is complete or turn is final.
language_confidence
number
The confidence score for the detected language, between 0 (low confidence) and 1 (high confidence). Only populated when language detection is enabled and an utterance is complete or turn is final.
speaker_label
string
The speaker label for this turn (e.g. A, B). Only present when speaker_labels is enabled. Short turns with less than approximately 1 second of audio will have the label UNKNOWN. See Streaming Diarization for more details.
end_of_turn_confidence
number
required
The confidence score that this is the end of a turn, between 0.0 (low confidence) and 1.0 (high confidence). See Turn Detection for more information.
words
array
required
Array of word-level details for this turn.
{
  "type": "Turn",
  "turn_order": 0,
  "turn_is_formatted": true,
  "end_of_turn": true,
  "transcript": "Hello world.",
  "end_of_turn_confidence": 0.98,
  "words": [
    {
      "text": "Hello",
      "start": 0,
      "end": 500,
      "confidence": 0.99
    },
    {
      "text": "world.",
      "start": 500,
      "end": 1000,
      "confidence": 0.98
    }
  ]
}

Session Terminated (Server Confirmation)

Server message confirming session termination with session statistics.
type
string
required
Indicates the session has been terminated. Allowed values: Termination.
audio_duration_seconds
integer
required
Duration of the audio in seconds.
session_duration_seconds
integer
required
Duration of the session in seconds.
{
  "type": "Termination",
  "audio_duration_seconds": 0,
  "session_duration_seconds": 0
}

LLM Gateway Response

Server message containing an LLM Gateway response for a finalized turn.
type
string
required
Identifies the type of the message. Allowed values: LLMGatewayResponse.
turn_order
integer
required
The order of the finalized turn that triggered the LLM Gateway call.
transcript
string
required
The finalized turn transcript that triggered the LLM Gateway call.
data
object
required
The chat completions response from the LLM Gateway.
{
  "type": "LLMGatewayResponse",
  "turn_order": 0,
  "transcript": "Hello world.",
  "data": {
    "request_id": "c4a91f7e-2b8d-4c50-8e16-9d6b3a2f1c08",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "Hello! How can I help?"
        },
        "finish_reason": "stop"
      }
    ],
    "usage": {
      "input_tokens": 12,
      "output_tokens": 8,
      "total_tokens": 20,
      "prompt_tokens_details": {},
      "completion_tokens_details": {}
    },
    "request": {},
    "response_time": 842
  }
}