Best Practices for Building Voice Agents
Introduction
AssemblyAI’s Universal-3 Pro Streaming is the most accurate real-time speech-to-text model designed for voice agents. It delivers formatted, immutable transcripts with sub-300ms latency, exceptional entity accuracy, native multilingual code switching, and a fully promptable interface — all optimized for conversational AI workflows.
Why Universal-3 Pro Streaming for Voice Agents?
Voice agents need speed, accuracy, and natural turn-taking. Universal-3 Pro Streaming is purpose-built for this:
Sub-300ms latency with formatted output
- Immutable transcripts arrive fully formatted (punctuation, capitalization) — no waiting for a separate formatting step
- Every final transcript is ready for immediate LLM processing
Exceptional entity accuracy
- Credit card numbers, phone numbers, email addresses, physical addresses, and names are transcribed with high accuracy
- Short utterances like “yes”, “no”, “mmhmm” are handled reliably
Punctuation-based turn detection
- Turn boundaries are determined by terminal punctuation (
.?!) combined with silence thresholds - Configurable
min_turn_silenceandmax_turn_silenceparameters let you tune responsiveness vs. accuracy - No confidence-score guessing — the model understands when a sentence is complete
Fully promptable
- Custom
promptparameter for transcription instructions - Dynamic prompting mid-session via
UpdateConfiguration— adapt the model to each stage of the conversation keyterms_promptfor boosting recognition of specific names, brands, and domain terms
Native multilingual support
- Supports English, Spanish, French, German, Italian, and Portuguese
- Automatic code-switching between languages within a single session
- Language-specific prompting for improved accuracy
What Languages Does Universal-3 Pro Streaming Support?
Universal-3 Pro Streaming supports six languages with automatic code-switching:
- English
- Spanish
- French
- German
- Italian
- Portuguese
The model handles code-switching natively — speakers can switch between supported languages mid-conversation without any configuration changes. Accuracy improves when you specify the expected language in the prompt. See Supported languages for the full language list and regional dialect reference.
To guide the model toward a specific language, prepend language information to the default prompt:
For multilingual conversations:
How Do I Get Started?
Complete voice agent stack
AssemblyAI provides speech-to-text. For a complete voice agent, you need:
- Speech-to-Text (STT): AssemblyAI Universal-3 Pro Streaming
- Large Language Model (LLM): OpenAI, Anthropic, Google, etc.
- Text-to-Speech (TTS): Rime, Cartesia, ElevenLabs, etc.
- Orchestration: LiveKit, Pipecat, or custom build
Pre-built integrations
LiveKit Agents (recommended)
LiveKit provides the fastest path to a working voice agent with AssemblyAI. See Universal-3 Pro Streaming on LiveKit for a full guide.
Pipecat by Daily
Pipecat is an open-source framework for conversational AI with maximum customizability. See Universal-3 Pro Streaming on Pipecat for a full guide.
Direct WebSocket connection
For custom builds, connect directly to the WebSocket API:
How Does Turn Detection Work?
Universal-3 Pro Streaming uses a punctuation-based turn detection system controlled by two parameters:
How it works:
- User speaks → audio streams to AssemblyAI
- User pauses for
min_turn_silence→ model checks for terminal punctuation (.?!) - If terminal punctuation found → turn ends immediately with
end_of_turn: true - If no terminal punctuation → partial emitted with
end_of_turn: false, turn continues - If silence reaches
max_turn_silence→ turn forced to end regardless of punctuation
This is different from the legacy Universal-Streaming models, which used a confidence-based end_of_turn_confidence_threshold. Universal-3 Pro Streaming does not use that parameter — turn decisions are based on punctuation after silence thresholds.
Configuration presets
Entity splitting tradeoff
Lower silence values produce faster transcripts but can split entities across turns:
For voice agents, the downstream LLM can usually piece together split entities. But if your use case involves entity extraction or alphanumeric dictation, increase min_turn_silence and max_turn_silence during those portions of the conversation using dynamic configuration updates.
How Do I Handle Barge-In and Interruptions?
SpeechStarted events
Universal-3 Pro Streaming emits SpeechStarted events when voice activity is detected. These events are key for barge-in handling — when a user starts speaking while the agent is still talking:
When you receive a SpeechStarted event:
- Stop TTS playback immediately
- Switch the agent back to listening mode
- Wait for the user’s full turn to complete before responding
VAD threshold alignment
Universal-3 Pro Streaming includes an internal Silero VAD controlled by the vad_threshold parameter (default 0.3). If you’re also running a local VAD (common in LiveKit and Pipecat), align the thresholds to avoid a dead zone where one detects speech but the other doesn’t:
If you’re in a noisy environment and getting false speech triggers, raise both thresholds together.
How Can I Use Prompting to Improve Accuracy?
The prompt parameter
Universal-3 Pro Streaming supports a prompt parameter for custom transcription instructions. When no prompt is provided, a default prompt optimized for turn detection is applied automatically.
Beta feature
Prompting is a beta feature. We recommend starting without a custom prompt to establish baseline performance, then experimenting to optimize for your use case.
Tips for effective prompts:
- Specify the audio context: accent, domain, expected utterance types
- Define punctuation rules: improves downstream LLM processing
- Preserve speech patterns: instruct the model to keep filler words for more natural interactions
- Specify language: prepend
Transcribe <language>.for non-English or multilingual conversations
Keyterms prompting
Use keyterms_prompt to boost recognition of specific names, brands, or domain terms — up to 100 terms per session:
Best practices for keyterms:
- Include proper names, product names, technical terms, and domain-specific jargon
- Include terms up to 50 characters each
- Don’t include common English words, single letters, or generic phrases
- Don’t exceed 100 terms total
For detailed guidance, see Keyterms prompting.
How Do I Update Configuration Mid-Session?
You can update prompt, keyterms_prompt, min_turn_silence, and max_turn_silence during an active session using UpdateConfiguration. This is one of Universal-3 Pro Streaming’s most powerful features for voice agents.
Dynamic keyterms by conversation stage
As your voice agent moves through different stages, update keyterms to match what the user is likely to say:
Dynamic prompting
You can also update the transcription prompt mid-session. This is especially powerful when paired with tool calls in your LLM:
- If your agent asks a yes/no question, prompt the model to anticipate short responses
- If your agent asks for a phone number or email, prompt it to expect those formats
- If you present a list of options, boost those options in the prompt
How Do I Use Speaker Diarization?
Streaming Diarization identifies and labels individual speakers in real time. Each Turn event includes a speaker_label field (e.g., "A", "B") indicating which speaker produced that transcript.
Enable it by adding speaker_labels: true to your connection parameters:
Speaker accuracy improves over the course of a session as the model accumulates embedding context.
With LiveKit:
With Pipecat (including custom formatting):
For more details, see Streaming Diarization and Multichannel.
How Do I Optimize for Latency?
Key optimizations
1. Use the right silence thresholds
Start with min_turn_silence=100 and max_turn_silence=1000. Only increase if you’re seeing entity splitting issues.
2. Eliminate additive delays in your orchestrator
In LiveKit with turn_detection="stt", set min_endpointing_delay=0 — LiveKit’s default 0.5s delay is additive on top of AssemblyAI’s own endpointing.
3. Use 16kHz sample rate
This balances audio quality and bandwidth. Higher sample rates don’t improve accuracy.
4. Align VAD thresholds
Mismatched VAD thresholds between your local VAD and AssemblyAI create a dead zone that delays interruption. Set both to 0.3.
5. Skip unnecessary features
Only enable speaker_labels if you need diarization. Only use keyterms_prompt if you have domain-specific terms. Each feature adds marginal processing overhead.
Latency breakdown
How Does the Message Sequence Work?
Universal-3 Pro Streaming sends messages in a specific sequence. Here’s what a typical conversation looks like:
1. Session begins
2. Speech detected
3. Partial transcript (during silence, no terminal punctuation)
4. Final transcript (terminal punctuation found, or max_turn_silence reached)
For Universal-3 Pro Streaming, end_of_turn and turn_is_formatted always have the same value. You can reliably use end_of_turn: true to detect a formatted, final transcript.
5. Session termination
For the complete message reference, see Message sequence.
How Can I Improve Accuracy?
Keyterms prompting
The single most effective way to improve accuracy on domain-specific terms. See How Can I Use Prompting to Improve Accuracy? above.
Dynamic configuration updates
Update keyterms and prompts mid-session based on conversation context. See How Do I Update Configuration Mid-Session? above.
Tune silence thresholds
If entities are splitting across turns, increase min_turn_silence (for punctuation-triggered splits) or max_turn_silence (for forced timeout splits). You can do this dynamically mid-session for specific conversation stages like entity dictation.
Noise handling
Universal-3 Pro Streaming handles background noise well out of the box. Avoid adding noise cancellation as a preprocessing step — the artifacts it introduces typically cause more harm than the background noise itself.
Scaling and Concurrency
Universal-3 Pro Streaming provides unlimited concurrent streams:
- No hard caps on simultaneous connections
- No overage fees for spike traffic
- Automatic scaling from 5 to 50,000+ streams
Rate limits:
- Free users: 5 new streams per minute
- Pay-as-you-go: 100 new streams per minute
- When using 70%+ of your limit, capacity automatically increases 10% every 60 seconds
These limits are designed to never interfere with legitimate applications. Your baseline limit is guaranteed and never decreases, so you can scale smoothly without artificial barriers.