Universal-3 Pro Streaming API
Universal-3 Pro Streaming API
Universal-3 Pro Streaming API
This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS, with no LiveKit, Pipecat, or other orchestrator in the loop.
Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages. This guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.
If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead. Semantic interruption handling is built in there.
A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:
For the full message protocol, including all event fields, audio framing, and termination, see the Universal-3 Pro message sequence reference.
Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:
Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.
While the agent is speaking, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt.
The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.
How it works:
Turn event is checked._should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly. This prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default. Raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.
If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling. See LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.
Three presets covering most voice-agent use cases:
For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.