AssemblyAI’s Universal-3 Pro Streaming is the most accurate real-time speech-to-text model designed for voice agents. It delivers formatted, immutable transcripts with sub-300ms latency, exceptional entity accuracy, native multilingual code switching, and a fully promptable interface, all optimized for conversational AI workflows.
The STT component is the “ears” of your voice agent. Transcription errors propagate into the LLM and response logic, so even small accuracy gaps compound in impact. Choosing and configuring the right STT model is one of the highest-leverage decisions you can make when building a voice agent. For guidance on how to evaluate and compare STT models for your use case, see the streaming evaluation guide.
Voice agents need speed, accuracy, and natural turn-taking. Universal-3 Pro Streaming is purpose-built for this:
Sub-300ms latency with formatted output
Exceptional entity accuracy
Punctuation-based turn detection
. ? !) combined with silence thresholdsmin_turn_silence and max_turn_silence parameters let you tune responsiveness vs. accuracyFully promptable
prompt parameter for transcription instructionsUpdateConfiguration. Adapt the model to each stage of the conversationkeyterms_prompt for boosting recognition of specific names, brands, and domain termsNative multilingual support
Universal-3 Pro Streaming supports six languages with automatic code-switching:
The model handles code-switching natively. Speakers can switch between supported languages mid-conversation without any configuration changes. Accuracy improves when you specify the expected language in the prompt. See Supported languages for the full language list and regional dialect reference.
To guide the model toward a specific language, prepend language information to the default prompt:
For multilingual conversations:
AssemblyAI provides a speech-to-speech Voice Agent API that abstracts away the complexity of a full voice agent stack: managed STT, LLM, turn detection, and TTS in a single endpoint.
For a cascading architecture, AssemblyAI has the best speech-to-text model. For a complete stack, you need:
LiveKit Agents (recommended)
LiveKit provides the fastest path to a working voice agent with AssemblyAI. See Universal-3 Pro Streaming on LiveKit for a full guide.
Pipecat by Daily
Pipecat is an open-source framework for conversational AI with maximum customizability. See Universal-3 Pro Streaming on Pipecat for a full guide.
For custom builds, connect directly to the WebSocket API:
Universal-3 Pro Streaming uses a punctuation-based turn detection system controlled by two parameters:
How it works:
min_turn_silence → model checks for terminal punctuation (. ? !)end_of_turn: trueend_of_turn: false, turn continuesmax_turn_silence → turn forced to end regardless of punctuationThis is different from the legacy Universal-Streaming models, which used a confidence-based end_of_turn_confidence_threshold. Universal-3 Pro Streaming does not use that parameter. Turn decisions are based on punctuation after silence thresholds.
Lower silence values produce faster transcripts but can split entities across turns:
For voice agents, the downstream LLM can usually piece together split entities. But if your use case involves entity extraction or alphanumeric dictation, increase min_turn_silence and max_turn_silence during those portions of the conversation using dynamic configuration updates.
Universal-3 Pro Streaming emits SpeechStarted events when speech has been detected. SpeechStarted is only emitted when the model produces a transcript. This makes it a reliable signal for barge-in handling:
When you receive a SpeechStarted event:
Users often produce short backchannel utterances (“mhm”, “yeah”, “um”, “okay”) while the agent is speaking. Treating every SpeechStarted event as a barge-in causes the agent to stop mid-sentence on these fillers, even though the user didn’t intend to interrupt.
The fix is to gate barge-in on each Turn event during agent speech: suppress the interrupt when the transcript is short or every token is a known backchannel. Implementation depends on your stack:
Universal-3 Pro Streaming includes an internal Silero VAD controlled by the vad_threshold parameter (default 0.3). If you’re also running a local VAD (common in LiveKit and Pipecat), align the thresholds to avoid a dead zone where one detects speech but the other doesn’t:
If you’re in a noisy environment and getting false speech triggers, raise both thresholds together.
Universal-3 Pro Streaming supports a prompt parameter for custom transcription instructions. When no prompt is provided, a default prompt optimized for turn detection is applied automatically.
Prompting is a beta feature. We recommend starting without a custom prompt to establish baseline performance, then experimenting to optimize for your use case.
Tips for effective prompts:
Transcribe <language>. for non-English or multilingual conversationsUse keyterms_prompt to boost recognition of specific names, brands, or domain terms, up to 100 terms per session:
Best practices for keyterms:
For detailed guidance, see Keyterms prompting.
You can update prompt, keyterms_prompt, min_turn_silence, and max_turn_silence during an active session using UpdateConfiguration. This is one of Universal-3 Pro Streaming’s most powerful features for voice agents.
As your voice agent moves through different stages, update keyterms to match what the user is likely to say:
You can also update the transcription prompt mid-session. This is especially powerful when paired with tool calls in your LLM:
Streaming Diarization identifies and labels individual speakers in real time. Each Turn event includes a speaker_label field (e.g., "A", "B") indicating which speaker produced that transcript.
Enable it by adding speaker_labels: true to your connection parameters:
Speaker accuracy improves over the course of a session as the model accumulates embedding context.
With LiveKit:
With Pipecat (including custom formatting):
For more details, see Streaming Diarization and Multichannel.
1. Use the right silence thresholds
Start with min_turn_silence=100 and max_turn_silence=1000. Only increase if you’re seeing entity splitting issues.
2. Tune interruption_delay for faster TTFT
The interruption_delay parameter controls how soon the first partial is emitted. Set interruption_delay=0 for the fastest possible time to first token (~300ms effective). The default of 500ms produces a first partial at ~800ms. See Tuning early partial timing for details.
3. Eliminate additive delays in your orchestrator
In LiveKit with turn_detection="stt", set min_endpointing_delay=0. LiveKit’s default 0.5s delay is additive on top of AssemblyAI’s own endpointing.
4. Use 16kHz sample rate
This balances audio quality and bandwidth. Higher sample rates don’t improve accuracy.
5. Align VAD thresholds
Mismatched VAD thresholds between your local VAD and AssemblyAI create a dead zone that delays interruption. Set both to 0.3.
6. Skip unnecessary features
Only enable speaker_labels if you need diarization. Only use keyterms_prompt if you have domain-specific terms. Each feature adds marginal processing overhead.
Universal-3 Pro Streaming sends messages in a specific sequence. Here’s what a typical conversation looks like:
1. Session begins
2. Speech detected
3. Early partial (emitted after 750ms of continuous speech)
4. Silence-based partial (speaker pauses, no terminal punctuation)
5. Final transcript (terminal punctuation found, or max_turn_silence reached)
For Universal-3 Pro Streaming, end_of_turn and turn_is_formatted always have the same value. You can reliably use end_of_turn: true to detect a formatted, final transcript.
6. Session termination
For the complete message reference, see Message sequence.
The single most effective way to improve accuracy on domain-specific terms. Keyterms are especially useful for improving recognition of proper nouns, product names, and technical jargon spoken with accents or in noisy environments. See How Can I Use Prompting to Improve Accuracy? above.
Update keyterms and prompts mid-session based on conversation context. See How Do I Update Configuration Mid-Session? above.
If entities are splitting across turns, increase min_turn_silence (for punctuation-triggered splits) or max_turn_silence (for forced timeout splits). You can do this dynamically mid-session for specific conversation stages like entity dictation.
Universal-3 Pro Streaming handles background noise well out of the box. Avoid adding noise cancellation as a preprocessing step. The artifacts it introduces typically cause more harm than the background noise itself.
For telephony environments with low-quality audio (such as 8 kHz mulaw), you can prompt the model to tag genuinely unclear segments as [unclear] rather than forcing a guess. This helps you identify audio segments that no model (or human) can reliably transcribe, and prevents inaccurate guesses from entering your downstream pipeline.
Universal-3 Pro Streaming provides unlimited concurrent streams:
Rate limits:
These limits are designed to never interfere with legitimate applications. Your baseline limit is guaranteed and never decreases, so you can scale smoothly without artificial barriers.
Benchmark scores are a useful starting point, but they don’t tell the full story. To determine which STT model works best for your voice agent in production:
For a complete evaluation framework including accuracy metrics, latency metrics, and ground truth best practices, see the streaming evaluation guide.