Universal-Streaming
Universal-Streaming
Universal-Streaming
Stream audio and receive real-time transcription results. Fast, cost-effective streaming transcription available in three variants:
streaming.assemblyai.com with
streaming.eu.assemblyai.com. Use your API key for authentication, or alternatively generate a temporary token and pass it via the token query parameter.
Whether to return formatted final transcripts with punctuation, casing, and inverse text normalization (e.g. dates, times, phone numbers). Does not control digit rendering.
API token for authentication (if using a temporary token).
The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection.
Send audio data chunks for transcription. The payload must be of type bytes and contain audio data between 50ms and 1000ms in length. When streaming from a pre-recorded file, pace the chunks at approximately real-time (for example, sleep for the chunk’s duration between sends) — sending chunks in a tight loop can produce inconsistent Turn messages.
Receive a formatted turn-based transcription result.
Receive an LLM Gateway response for a finalized turn. Emitted once per turn when llm_gateway is configured on the connection.
A list of words and phrases to improve recognition accuracy for. See Keyterms Prompting for more details.
The maximum amount of silence in milliseconds allowed in a turn before end of turn is triggered. See Turn Detection for configuration details.
The minimum amount of silence in milliseconds required to detect end of turn when confident. See Turn Detection for configuration details.
Whether to enable Streaming Speaker Diarization. When enabled, each Turn event will include a speaker_label field and each final word in the words array will include a speaker field for word-level speaker attribution.
The maximum number of speakers expected in the audio stream (1-10). Setting this can improve speaker label accuracy when you know the number of speakers in advance. Only used when speaker_labels is enabled. See Streaming Diarization for more details.
The confidence threshold (0.0 to 1.0) to use when determining if the end of a turn has been reached. See Turn Detection for configuration details.
Note: This parameter is only supported for the Universal-streaming model.
Enable domain-specific transcription models to improve accuracy for specialized terminology. Set to "medical-v1" to enable Medical Mode for improved accuracy of medical terms such as medications, procedures, conditions, and dosages. Supported languages: English (en), Spanish (es), German (de), French (fr). If used with an unsupported language, the parameter is ignored and a warning is returned.
JSON-stringified LLM Gateway configuration that processes each finalized turn. Follows the same interface as the Chat Completions endpoint and accepts model, messages, tools, tool_choice, post_processing_steps, and max_tokens. See Apply LLM Gateway to Streaming for the full schema and examples.
Send a keep-alive message to reset the inactivity_timeout timer. This is not necessary by default — sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached. This message is only needed if you have set inactivity_timeout and want to keep the session open during periods where no audio is being sent.