Universal-Streaming

Stream audio and receive real-time transcription results. Fast, cost-effective streaming transcription available in three variants: - **Universal-Streaming English** — the fastest real-time English transcription - **Universal-Streaming Multilingual** — multilingual support (English, Spanish, German, French, Portuguese, and Italian) at the same speed and price - **Whisper-Streaming** — open-source Whisper powered by AssemblyAI's infrastructure with 99+ languages <Note> To use our EU server for Streaming STT, replace `streaming.assemblyai.com` with `streaming.eu.assemblyai.com`. </Note>

Handshake

WSS
wss://streaming.assemblyai.com/v3/ws

Headers

AuthorizationstringOptional

Use your API key for authentication, or alternatively generate a temporary token and pass it via the token query parameter.

Query parameters

speech_modelenumRequired
The speech model used for your Streaming session.
Allowed values:
encodingenumOptionalDefaults to pcm_s16le
Encoding of the audio stream.
Allowed values:
format_turnsenumOptionalDefaults to false
Whether to return formatted final transcripts.
Allowed values:
inactivity_timeoutintegerOptional5-3600
Optional time in seconds of inactivity before session is terminated. If not set, no inactivity timeout is applied.
keyterms_promptlist of stringsOptional

A list of words and phrases to improve recognition accuracy for. See Keyterms Prompting for more details.

language_detectionenumOptionalDefaults to false
Whether to detect the language and return language metadata on utterances and final turns. Only available for the multilingual model.
Allowed values:
max_turn_silenceintegerOptionalDefaults to 1280

The maximum amount of silence in milliseconds allowed in a turn before end of turn is triggered. See Turn Detection for configuration details.

min_turn_silenceintegerOptionalDefaults to 400

The minimum amount of silence in milliseconds required to detect end of turn when confident. See Turn Detection for configuration details.

sample_rateintegerRequiredDefaults to 16000
Sample rate of the audio stream.
tokenstringOptional

API token for authentication (if using a temporary token).

vad_thresholddoubleOptionalDefaults to 0.4

The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.

end_of_turn_confidence_thresholddoubleOptionalDefaults to 0.4Deprecated

The confidence threshold (0.0 to 1.0) to use when determining if the end of a turn has been reached. See Turn Detection for configuration details.

languageenumOptionalDefaults to enDeprecated
The language of your audio stream.
Allowed values:

Send

sendAudiostringRequiredformat: "binary"
Send audio data chunks for transcription. The payload must be of type bytes and contain audio data between 50ms and 1000ms in length.
OR
sendUpdateConfigurationobjectRequired
Update streaming configuration parameters during an active session.
OR
sendForceEndpointobjectRequired
Manually force an endpoint in the transcription.
OR
sendSessionTerminationobjectRequired
Gracefully terminate the streaming session.

Receive

receiveSessionBeginsobjectRequired
Receive confirmation that the streaming session has successfully started.
OR
receiveSpeechStartedobjectRequired
Receive a notification that speech has been detected in the audio stream.
OR
receiveTurnobjectRequired

Receive a formatted turn-based transcription result.

OR
receiveTerminationobjectRequired
Receive confirmation that the session has been terminated by the server.