Best Practices for Building Voice Agents

Introduction

AssemblyAI’s Universal-3 Pro Streaming is the most accurate real-time speech-to-text model designed for voice agents. It delivers formatted, immutable transcripts with sub-300ms latency, exceptional entity accuracy, native multilingual code switching, and a fully promptable interface, all optimized for conversational AI workflows. The STT component is the “ears” of your voice agent. Transcription errors propagate into the LLM and response logic, so even small accuracy gaps compound in impact. Choosing and configuring the right STT model is one of the highest-leverage decisions you can make when building a voice agent. For guidance on how to evaluate and compare STT models for your use case, see the streaming evaluation guide.

Why Universal-3 Pro Streaming for Voice Agents?

Voice agents need speed, accuracy, and natural turn-taking. Universal-3 Pro Streaming is purpose-built for this: Sub-300ms latency with formatted output

Immutable transcripts arrive fully formatted (punctuation, capitalization), no waiting for a separate formatting step
Every final transcript is ready for immediate LLM processing

Exceptional entity accuracy

Credit card numbers, phone numbers, email addresses, physical addresses, and names are transcribed with high accuracy
Short utterances like “yes”, “no”, “mmhmm” are handled reliably

Punctuation-based turn detection

Turn boundaries are determined by terminal punctuation (. ? !) combined with silence thresholds
Configurable min_turn_silence and max_turn_silence parameters let you tune responsiveness vs. accuracy
No confidence-score guessing. The model understands when a sentence is complete

Fully promptable

Contextual prompt parameter for describing what the audio is about
Dynamic prompting mid-session via UpdateConfiguration. Adapt the model to each stage of the conversation
keyterms_prompt for boosting recognition of specific names, brands, and domain terms

Native multilingual support

Supports English, Spanish, French, German, Italian, and Portuguese
Automatic code-switching between languages within a single session
Language-specific prompting for improved accuracy

What Languages Does Universal-3 Pro Streaming Support?

Universal-3 Pro Streaming supports six languages with automatic code-switching:

English
Spanish
French
German
Italian
Portuguese

The model handles code-switching natively. Speakers can switch between supported languages mid-conversation without any configuration changes. Accuracy improves when you specify the expected language in the prompt. See Supported languages for the full language list and regional dialect reference. To bias the model toward a specific language, pass the language_code connection parameter — see Language selection. For multilingual conversations, no configuration is needed — code-switching is native.

How Do I Get Started?

Complete voice agent stack

AssemblyAI provides a speech-to-speech Voice Agent API that abstracts away the complexity of a full voice agent stack: managed STT, LLM, turn detection, and TTS in a single endpoint. For a cascading architecture, AssemblyAI has the best speech-to-text model. For a complete stack, you need:

Speech-to-Text (STT): AssemblyAI Universal-3 Pro Streaming
Large Language Model (LLM): OpenAI, Anthropic, Google, etc.
Text-to-Speech (TTS): Rime, Cartesia, ElevenLabs, etc.
Orchestration: LiveKit, Pipecat, or custom build

Pre-built integrations

LiveKit Agents (recommended) LiveKit provides the fastest path to a working voice agent with AssemblyAI. See Universal-3 Pro Streaming on LiveKit for a full guide.

from livekit.agents import AgentSession
from livekit.plugins import assemblyai, silero

session = AgentSession(
    stt=assemblyai.STT(
        model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(
        activation_threshold=0.3,
    ),
    turn_detection="stt",
    min_endpointing_delay=0,
)

Pipecat by Daily Pipecat is an open-source framework for conversational AI with maximum customizability. See Universal-3 Pro Streaming on Pipecat for a full guide.

from pipecat.services.assemblyai.stt import AssemblyAISTTService
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    connection_params=AssemblyAIConnectionParams(
        speech_model="u3-rt-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
    ),
    vad_force_turn_endpoint=False,  # Use AssemblyAI's built-in turn detection
)

Direct WebSocket connection

For custom builds, connect directly to the WebSocket API:

import json
import pyaudio
import websocket
import threading
import time
from urllib.parse import urlencode

API_KEY = "YOUR_API_KEY"
SAMPLE_RATE = 16000

CONNECTION_PARAMS = {
    "sample_rate": SAMPLE_RATE,
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 100,
    "max_turn_silence": 1000,
}

API_ENDPOINT_BASE = "wss://streaming.assemblyai.com/v3/ws"
API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"

def on_message(ws, message):
    data = json.loads(message)

    if data.get("type") == "Turn":
        transcript = data.get("transcript", "")
        end_of_turn = data.get("end_of_turn", False)

        if end_of_turn:
            # Final transcript - send to LLM
            print(f"Final: {transcript}")
        else:
            # Partial - can start pre-emptive LLM generation
            print(f"Partial: {transcript}")

    elif data.get("type") == "SpeechStarted":
        # User started speaking - handle barge-in
        print("Speech detected, interrupt agent if speaking")

ws = websocket.WebSocketApp(
    API_ENDPOINT,
    header={"Authorization": API_KEY},
    on_message=on_message,
)

How Does Turn Detection Work?

Universal-3 Pro Streaming uses a punctuation-based turn detection system controlled by two parameters:

Parameter	Default	Description
`min_turn_silence`	`100` ms	Silence before a speculative end-of-turn check fires.
`max_turn_silence`	`1000` ms	Maximum silence before forcing the turn to end.

How it works:

User speaks → audio streams to AssemblyAI
User pauses for min_turn_silence → model checks for terminal punctuation (. ? !)
If terminal punctuation found → turn ends immediately with end_of_turn: true
If no terminal punctuation → partial emitted with end_of_turn: false, turn continues
If silence reaches max_turn_silence → turn forced to end regardless of punctuation

This is different from the legacy Universal-Streaming models, which used a confidence-based end_of_turn_confidence_threshold. Universal-3 Pro Streaming does not use that parameter. Turn decisions are based on punctuation after silence thresholds.

Configuration presets

# Fast - quick confirmations, IVR, yes/no questions
fast_params = {
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 100,
    "max_turn_silence": 800,
}

# Balanced - most voice agent conversations (recommended)
balanced_params = {
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 100,
    "max_turn_silence": 1000,
}

# Patient - entity dictation, complex instructions
patient_params = {
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 200,
    "max_turn_silence": 2000,
}

Entity splitting tradeoff

Lower silence values produce faster transcripts but can split entities across turns:

# With (min_turn_silence=100, max_turn_silence=1000)
"It's John."                    → turn ends (period found after 100ms pause)
"Smith."                        → new turn
"At gmail.com."                 → new turn

# With (min_turn_silence=400, max_turn_silence=2000)
"It's john.smith@gmail.com."    → single turn (properly formatted)

For voice agents, the downstream LLM can usually piece together split entities. But if your use case involves entity extraction or alphanumeric dictation, increase min_turn_silence and max_turn_silence during those portions of the conversation using dynamic configuration updates.

How Do I Handle Barge-In and Interruptions?

SpeechStarted events

Universal-3 Pro Streaming emits SpeechStarted events when speech has been detected. SpeechStarted is only emitted when the model produces a transcript. This makes it a reliable signal for barge-in handling:

{
  "type": "SpeechStarted",
  "timestamp": 14400,
  "confidence": 0.79
}

When you receive a SpeechStarted event:

Stop TTS playback immediately
Switch the agent back to listening mode
Wait for the user’s full turn to complete before responding

Filtering backchannels

Users often produce short backchannel utterances (“mhm”, “yeah”, “um”, “okay”) while the agent is speaking. Treating every SpeechStarted event as a barge-in causes the agent to stop mid-sentence on these fillers, even though the user didn’t intend to interrupt. The fix is to gate barge-in on each Turn event during agent speech: suppress the interrupt when the transcript is short or every token is a known backchannel. Implementation depends on your stack:

Voice Agent API: semantic interruption classification is built in. See Turn detection and interruptions.
LiveKit: LiveKit Cloud users should enable adaptive interruption handling. Self-hosted deployments use the two custom filters in the same section.
Direct WebSocket: see Interruption handling on the Universal-3 Pro Streaming API page for the combined word-count + filler-word filter.

VAD threshold alignment

Universal-3 Pro Streaming includes an internal Silero VAD controlled by the vad_threshold parameter (default 0.3). If you’re also running a local VAD (common in LiveKit and Pipecat), align the thresholds to avoid a dead zone where one detects speech but the other doesn’t:

# Both thresholds aligned at 0.3
stt = assemblyai.STT(
    model="u3-rt-pro",
    vad_threshold=0.3,
)
vad = silero.VAD.load(
    activation_threshold=0.3,
)

If you’re in a noisy environment and getting false speech triggers, raise both thresholds together.

How Can I Use Prompting to Improve Accuracy?

The prompt parameter

Universal-3 Pro Streaming supports a prompt parameter for contextual prompting — a natural-language description of what the audio is about. Transcription behavior itself (verbatim output, punctuation, turn detection) is built in and managed automatically; the prompt carries context, not instructions.

Beta featurePrompting is a beta feature. We recommend starting without a prompt to establish baseline performance, then adding context to optimize for your use case.

CONNECTION_PARAMS = {
    "speech_model": "u3-rt-pro",
    "prompt": "AI voice agent call with a customer about an internet service outage."
}

Tips for effective prompts:

Describe the conversation: domain, scenario, or full details — start broad, and add only details your application actually knows (see the three context levels)
Include names and identifiers you already know: caller name, account or order IDs, products — detailed context helps the model spell them correctly
Specify language: use the language_code connection parameter for monolingual sessions (Language selection)

Keyterms prompting

Use keyterms_prompt to boost recognition of specific names, brands, or domain terms, up to 100 terms per session:

CONNECTION_PARAMS = {
    "speech_model": "u3-rt-pro",
    "keyterms_prompt": json.dumps([
        "AssemblyAI",
        "LiveKit",
        "Dr. Rodriguez",
        "Lisinopril",
        "iPhone 15 Pro",
    ])
}

Best practices for keyterms:

Include proper names, product names, technical terms, and domain-specific jargon
Include terms up to 50 characters each
Don’t include common English words, single letters, or generic phrases
Don’t exceed 100 terms total

For detailed guidance, see Keyterms prompting.

How Do I Update Configuration Mid-Session?

You can update prompt, keyterms_prompt, min_turn_silence, and max_turn_silence during an active session using UpdateConfiguration. This is one of Universal-3 Pro Streaming’s most powerful features for voice agents.

Dynamic keyterms by conversation stage

As your voice agent moves through different stages, update keyterms to match what the user is likely to say:

# Caller identification stage
ws.send(json.dumps({
    "type": "UpdateConfiguration",
    "keyterms_prompt": ["Kelly Byrne-Donoghue", "date of birth", "January", "February"]
}))

# Medical intake stage
ws.send(json.dumps({
    "type": "UpdateConfiguration",
    "keyterms_prompt": ["cardiology", "echocardiogram", "Dr. Patel", "metoprolol"]
}))

# Payment stage - also increase max_turn_silence for credit card dictation
ws.send(json.dumps({
    "type": "UpdateConfiguration",
    "keyterms_prompt": ["Visa", "Mastercard", "American Express"],
    "max_turn_silence": 3000
}))

Dynamic prompting

You can also update the transcription prompt mid-session. This is especially powerful when paired with tool calls in your LLM:

If your agent asks a yes/no question, prompt the model to anticipate short responses
If your agent asks for a phone number or email, prompt it to expect those formats
If you present a list of options, boost those options in the prompt

# After asking "Would you like to confirm your appointment?"
ws.send(json.dumps({
    "type": "UpdateConfiguration",
    "prompt": "User is responding yes or no to a confirmation question. Expect short responses."
}))

# After asking "What's your phone number?"
ws.send(json.dumps({
    "type": "UpdateConfiguration",
    "prompt": "User is dictating a phone number. Expect digits and formatting.",
    "max_turn_silence": 3000
}))

How Do I Use Speaker Diarization?

Streaming Diarization identifies and labels individual speakers in real time. Each Turn event includes a speaker_label field (e.g., "A", "B") indicating which speaker produced that transcript. Enable it by adding speaker_labels: true to your connection parameters:

CONNECTION_PARAMS = {
    "speech_model": "u3-rt-pro",
    "speaker_labels": True,
}

Speaker accuracy improves over the course of a session as the model accumulates embedding context. With LiveKit:

stt = assemblyai.STT(
    model="u3-rt-pro",
    speaker_labels=True,
)

With Pipecat (including custom formatting):

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    connection_params=AssemblyAIConnectionParams(
        speech_model="u3-rt-pro",
        speaker_labels=True,
    ),
    speaker_format="[{speaker}] {text}",
)

For more details, see Streaming Diarization and Multichannel.

How Do I Optimize for Latency?

Key optimizations

1. Use the right silence thresholds Start with min_turn_silence=100 and max_turn_silence=1000. Only increase if you’re seeing entity splitting issues. 2. Tune interruption_delay for faster TTFT The interruption_delay parameter controls how soon the first partial is emitted. Set interruption_delay=0 for the fastest possible time to first token (~300ms effective). The default of 500ms produces a first partial at ~800ms. See Tuning early partial timing for details. 3. Eliminate additive delays in your orchestrator In LiveKit with turn_detection="stt", set min_endpointing_delay=0. LiveKit’s default 0.5s delay is additive on top of AssemblyAI’s own endpointing. 4. Use 16kHz sample rate This balances audio quality and bandwidth. Higher sample rates don’t improve accuracy. 5. Align VAD thresholds Mismatched VAD thresholds between your local VAD and AssemblyAI create a dead zone that delays interruption. Set both to 0.3. 6. Skip unnecessary features Only enable speaker_labels if you need diarization. Only use keyterms_prompt if you have domain-specific terms. Each feature adds marginal processing overhead.

Latency breakdown

Stage	Typical latency	Notes
Audio to AssemblyAI	~50ms	Network dependent
Speech-to-text	~200-300ms	Sub-300ms P50
`min_turn_silence` check	100ms+	Configurable
`max_turn_silence` fallback	1000ms+	Only if no terminal punctuation

How Does the Message Sequence Work?

Universal-3 Pro Streaming sends messages in a specific sequence. Here’s what a typical conversation looks like: 1. Session begins

{
  "type": "Begin",
  "id": "session-id",
  "expires_at": 1759796682
}

2. Speech detected

{
  "type": "SpeechStarted",
  "timestamp": 1200,
  "confidence": 0.85
}

3. Early partial (emitted after 750ms of continuous speech)

{
  "type": "Turn",
  "turn_order": 0,
  "end_of_turn": false,
  "turn_is_formatted": false,
  "transcript": "Yeah my credit card..."
}

4. Silence-based partial (speaker pauses, no terminal punctuation)

{
  "type": "Turn",
  "turn_order": 0,
  "end_of_turn": false,
  "turn_is_formatted": false,
  "transcript": "Yeah my credit card number is--"
}

5. Final transcript (terminal punctuation found, or max_turn_silence reached)

{
  "type": "Turn",
  "turn_order": 0,
  "end_of_turn": true,
  "turn_is_formatted": true,
  "transcript": "Yeah, my credit card number is 8888-8888-8888-8888.",
  "speaker_label": "A"
}

For Universal-3 Pro Streaming, end_of_turn and turn_is_formatted always have the same value. You can reliably use end_of_turn: true to detect a formatted, final transcript.

6. Session termination

{
  "type": "Termination",
  "audio_duration_seconds": 45.2
}

For the complete message reference, see Message sequence.

How Can I Improve Accuracy?

Keyterms prompting

The single most effective way to improve accuracy on domain-specific terms. Keyterms are especially useful for improving recognition of proper nouns, product names, and technical jargon spoken with accents or in noisy environments. See How Can I Use Prompting to Improve Accuracy? above.

Dynamic configuration updates

Update keyterms and prompts mid-session based on conversation context. See How Do I Update Configuration Mid-Session? above.

Tune silence thresholds

If entities are splitting across turns, increase min_turn_silence (for punctuation-triggered splits) or max_turn_silence (for forced timeout splits). You can do this dynamically mid-session for specific conversation stages like entity dictation.

Noise handling

Universal-3 Pro Streaming handles background noise well out of the box. Avoid adding noise cancellation as a preprocessing step. The artifacts it introduces typically cause more harm than the background noise itself. For telephony environments with low-quality audio (such as 8 kHz mulaw), you can prompt the model to tag genuinely unclear segments as [unclear] rather than forcing a guess. This helps you identify audio segments that no model (or human) can reliably transcribe, and prevents inaccurate guesses from entering your downstream pipeline.

Scaling and Rate Limits

Universal-3 Pro Streaming provides unlimited parallel streams:

No hard caps on simultaneous connections
No overage fees for spike traffic
Automatic scaling from 5 to 50,000+ streams

Rate limits:

Free users: 5 new streams per minute
Pay-as-you-go: 100 new streams per minute
When using 70%+ of your limit, capacity automatically increases 10% every 60 seconds

These limits are designed to never interfere with legitimate applications. Your baseline limit is guaranteed and never decreases, so you can scale smoothly without artificial barriers.

Evaluating your voice agent

Benchmark scores are a useful starting point, but they don’t tell the full story. To determine which STT model works best for your voice agent in production:

Run A/B tests at scale. Swap models in your agent pipeline, compare core outcome metrics (task completion rate, booking rate, resolution rate), and let real user behavior determine the winner.
Optimize for outcomes, not benchmarks. The question is not which model has the best WER. It is which model drives users to complete their goal most reliably.
Simulate real scenarios early. Before you have real users, manually test the full agent flow under realistic conditions. This surfaces issues that isolated STT benchmarks will miss.

For a complete evaluation framework including accuracy metrics, latency metrics, and ground truth best practices, see the streaming evaluation guide.

​Introduction

​Why Universal-3 Pro Streaming for Voice Agents?

​What Languages Does Universal-3 Pro Streaming Support?

​How Do I Get Started?

​Complete voice agent stack

​Pre-built integrations

​Direct WebSocket connection

​How Does Turn Detection Work?

​Configuration presets

​Entity splitting tradeoff

​How Do I Handle Barge-In and Interruptions?

​SpeechStarted events

​Filtering backchannels

​VAD threshold alignment

​How Can I Use Prompting to Improve Accuracy?

​The prompt parameter

​Keyterms prompting

​How Do I Update Configuration Mid-Session?

​Dynamic keyterms by conversation stage

​Dynamic prompting

​How Do I Use Speaker Diarization?

​How Do I Optimize for Latency?

​Key optimizations

​Latency breakdown

​How Does the Message Sequence Work?

​How Can I Improve Accuracy?

​Keyterms prompting

​Dynamic configuration updates

​Tune silence thresholds

​Noise handling

​Scaling and Rate Limits

​Evaluating your voice agent

​Additional Resources

Introduction

Why Universal-3 Pro Streaming for Voice Agents?

What Languages Does Universal-3 Pro Streaming Support?

How Do I Get Started?

Complete voice agent stack

Pre-built integrations

Direct WebSocket connection

How Does Turn Detection Work?

Configuration presets

Entity splitting tradeoff

How Do I Handle Barge-In and Interruptions?

SpeechStarted events

Filtering backchannels

VAD threshold alignment

How Can I Use Prompting to Improve Accuracy?

The prompt parameter

Keyterms prompting

How Do I Update Configuration Mid-Session?

Dynamic keyterms by conversation stage

Dynamic prompting

How Do I Use Speaker Diarization?

How Do I Optimize for Latency?

Key optimizations

Latency breakdown

How Does the Message Sequence Work?

How Can I Improve Accuracy?

Keyterms prompting

Dynamic configuration updates

Tune silence thresholds

Noise handling

Scaling and Rate Limits

Evaluating your voice agent

Additional Resources