Use cases & integrationsUse case guides

Best Practices for Building Voice Agents

Introduction

AssemblyAI’s Universal-3 Pro Streaming is the most accurate real-time speech-to-text model designed for voice agents. It delivers formatted, immutable transcripts with sub-300ms latency, exceptional entity accuracy, native multilingual code switching, and a fully promptable interface — all optimized for conversational AI workflows.

Why Universal-3 Pro Streaming for Voice Agents?

Voice agents need speed, accuracy, and natural turn-taking. Universal-3 Pro Streaming is purpose-built for this:

Sub-300ms latency with formatted output

  • Immutable transcripts arrive fully formatted (punctuation, capitalization) — no waiting for a separate formatting step
  • Every final transcript is ready for immediate LLM processing

Exceptional entity accuracy

  • Credit card numbers, phone numbers, email addresses, physical addresses, and names are transcribed with high accuracy
  • Short utterances like “yes”, “no”, “mmhmm” are handled reliably

Punctuation-based turn detection

  • Turn boundaries are determined by terminal punctuation (. ? !) combined with silence thresholds
  • Configurable min_turn_silence and max_turn_silence parameters let you tune responsiveness vs. accuracy
  • No confidence-score guessing — the model understands when a sentence is complete

Fully promptable

  • Custom prompt parameter for transcription instructions
  • Dynamic prompting mid-session via UpdateConfiguration — adapt the model to each stage of the conversation
  • keyterms_prompt for boosting recognition of specific names, brands, and domain terms

Native multilingual support

  • Supports English, Spanish, French, German, Italian, and Portuguese
  • Automatic code-switching between languages within a single session
  • Language-specific prompting for improved accuracy

What Languages Does Universal-3 Pro Streaming Support?

Universal-3 Pro Streaming supports six languages with automatic code-switching:

  • English
  • Spanish
  • French
  • German
  • Italian
  • Portuguese

The model handles code-switching natively — speakers can switch between supported languages mid-conversation without any configuration changes. Accuracy improves when you specify the expected language in the prompt. See Supported languages for the full language list and regional dialect reference.

To guide the model toward a specific language, prepend language information to the default prompt:

Transcribe Spanish. Transcribe verbatim. Rules:
Always include punctuation in output.
Use period/question mark ONLY for complete sentences.
Use comma for mid-sentence pauses.
Use no punctuation for incomplete trailing speech.
Filler words (um, uh, so, like) indicate speaker will continue.

For multilingual conversations:

Transcribe multilingual conversation in Spanish and English.
Transcribe verbatim. Rules:
Always include punctuation in output.
Use period/question mark ONLY for complete sentences.
Use comma for mid-sentence pauses.
Use no punctuation for incomplete trailing speech.
Filler words (um, uh, so, like) indicate speaker will continue.

How Do I Get Started?

Complete voice agent stack

AssemblyAI provides speech-to-text. For a complete voice agent, you need:

  1. Speech-to-Text (STT): AssemblyAI Universal-3 Pro Streaming
  2. Large Language Model (LLM): OpenAI, Anthropic, Google, etc.
  3. Text-to-Speech (TTS): Rime, Cartesia, ElevenLabs, etc.
  4. Orchestration: LiveKit, Pipecat, or custom build

Pre-built integrations

LiveKit Agents (recommended)

LiveKit provides the fastest path to a working voice agent with AssemblyAI. See Universal-3 Pro Streaming on LiveKit for a full guide.

1from livekit.agents import AgentSession
2from livekit.plugins import assemblyai, silero
3
4session = AgentSession(
5 stt=assemblyai.STT(
6 model="u3-rt-pro",
7 min_turn_silence=100,
8 max_turn_silence=1000,
9 vad_threshold=0.3,
10 ),
11 vad=silero.VAD.load(
12 activation_threshold=0.3,
13 ),
14 turn_detection="stt",
15 min_endpointing_delay=0,
16)

Pipecat by Daily

Pipecat is an open-source framework for conversational AI with maximum customizability. See Universal-3 Pro Streaming on Pipecat for a full guide.

1from pipecat.services.assemblyai.stt import AssemblyAISTTService
2from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
3
4stt = AssemblyAISTTService(
5 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
6 connection_params=AssemblyAIConnectionParams(
7 speech_model="u3-rt-pro",
8 min_turn_silence=100,
9 max_turn_silence=1000,
10 ),
11 vad_force_turn_endpoint=False, # Use AssemblyAI's built-in turn detection
12)

Direct WebSocket connection

For custom builds, connect directly to the WebSocket API:

1import json
2import pyaudio
3import websocket
4import threading
5import time
6from urllib.parse import urlencode
7
8API_KEY = "YOUR_API_KEY"
9SAMPLE_RATE = 16000
10
11CONNECTION_PARAMS = {
12 "sample_rate": SAMPLE_RATE,
13 "speech_model": "u3-rt-pro",
14 "min_turn_silence": 100,
15 "max_turn_silence": 1000,
16}
17
18API_ENDPOINT_BASE = "wss://streaming.assemblyai.com/v3/ws"
19API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"
20
21def on_message(ws, message):
22 data = json.loads(message)
23
24 if data.get("type") == "Turn":
25 transcript = data.get("transcript", "")
26 end_of_turn = data.get("end_of_turn", False)
27
28 if end_of_turn:
29 # Final transcript — send to LLM
30 print(f"Final: {transcript}")
31 else:
32 # Partial — can start pre-emptive LLM generation
33 print(f"Partial: {transcript}")
34
35 elif data.get("type") == "SpeechStarted":
36 # User started speaking — handle barge-in
37 print("Speech detected — interrupt agent if speaking")
38
39ws = websocket.WebSocketApp(
40 API_ENDPOINT,
41 header={"Authorization": API_KEY},
42 on_message=on_message,
43)

How Does Turn Detection Work?

Universal-3 Pro Streaming uses a punctuation-based turn detection system controlled by two parameters:

ParameterDefaultDescription
min_turn_silence100 msSilence before a speculative end-of-turn check fires.
max_turn_silence1000 msMaximum silence before forcing the turn to end.

How it works:

  1. User speaks → audio streams to AssemblyAI
  2. User pauses for min_turn_silence → model checks for terminal punctuation (. ? !)
  3. If terminal punctuation found → turn ends immediately with end_of_turn: true
  4. If no terminal punctuation → partial emitted with end_of_turn: false, turn continues
  5. If silence reaches max_turn_silence → turn forced to end regardless of punctuation

This is different from the legacy Universal-Streaming models, which used a confidence-based end_of_turn_confidence_threshold. Universal-3 Pro Streaming does not use that parameter — turn decisions are based on punctuation after silence thresholds.

Configuration presets

1# Fast — quick confirmations, IVR, yes/no questions
2fast_params = {
3 "speech_model": "u3-rt-pro",
4 "min_turn_silence": 100,
5 "max_turn_silence": 800,
6}
7
8# Balanced — most voice agent conversations (recommended)
9balanced_params = {
10 "speech_model": "u3-rt-pro",
11 "min_turn_silence": 100,
12 "max_turn_silence": 1000,
13}
14
15# Patient — entity dictation, complex instructions
16patient_params = {
17 "speech_model": "u3-rt-pro",
18 "min_turn_silence": 200,
19 "max_turn_silence": 2000,
20}

Entity splitting tradeoff

Lower silence values produce faster transcripts but can split entities across turns:

1# With (min_turn_silence=100, max_turn_silence=1000)
2"It's John." → turn ends (period found after 100ms pause)
3"Smith." → new turn
4"At gmail.com." → new turn
5
6# With (min_turn_silence=400, max_turn_silence=2000)
7"It's john.smith@gmail.com." → single turn (properly formatted)

For voice agents, the downstream LLM can usually piece together split entities. But if your use case involves entity extraction or alphanumeric dictation, increase min_turn_silence and max_turn_silence during those portions of the conversation using dynamic configuration updates.

How Do I Handle Barge-In and Interruptions?

SpeechStarted events

Universal-3 Pro Streaming emits SpeechStarted events when voice activity is detected. These events are key for barge-in handling — when a user starts speaking while the agent is still talking:

1{
2 "type": "SpeechStarted",
3 "timestamp": 14400,
4 "confidence": 0.79
5}

When you receive a SpeechStarted event:

  1. Stop TTS playback immediately
  2. Switch the agent back to listening mode
  3. Wait for the user’s full turn to complete before responding

VAD threshold alignment

Universal-3 Pro Streaming includes an internal Silero VAD controlled by the vad_threshold parameter (default 0.3). If you’re also running a local VAD (common in LiveKit and Pipecat), align the thresholds to avoid a dead zone where one detects speech but the other doesn’t:

1# Both thresholds aligned at 0.3
2stt = assemblyai.STT(
3 model="u3-rt-pro",
4 vad_threshold=0.3,
5)
6vad = silero.VAD.load(
7 activation_threshold=0.3,
8)

If you’re in a noisy environment and getting false speech triggers, raise both thresholds together.

How Can I Use Prompting to Improve Accuracy?

The prompt parameter

Universal-3 Pro Streaming supports a prompt parameter for custom transcription instructions. When no prompt is provided, a default prompt optimized for turn detection is applied automatically.

Beta feature

Prompting is a beta feature. We recommend starting without a custom prompt to establish baseline performance, then experimenting to optimize for your use case.

1CONNECTION_PARAMS = {
2 "speech_model": "u3-rt-pro",
3 "prompt": "Transcribe this audio: AI voice agent talking to a human for customer service. Mandatory: Transcribe verbatim with all spoken filler words, hesitations, repetitions, and false starts exactly as spoken."
4}

Tips for effective prompts:

  • Specify the audio context: accent, domain, expected utterance types
  • Define punctuation rules: improves downstream LLM processing
  • Preserve speech patterns: instruct the model to keep filler words for more natural interactions
  • Specify language: prepend Transcribe <language>. for non-English or multilingual conversations

Keyterms prompting

Use keyterms_prompt to boost recognition of specific names, brands, or domain terms — up to 100 terms per session:

1CONNECTION_PARAMS = {
2 "speech_model": "u3-rt-pro",
3 "keyterms_prompt": json.dumps([
4 "AssemblyAI",
5 "LiveKit",
6 "Dr. Rodriguez",
7 "Lisinopril",
8 "iPhone 15 Pro",
9 ])
10}

Best practices for keyterms:

  • Include proper names, product names, technical terms, and domain-specific jargon
  • Include terms up to 50 characters each
  • Don’t include common English words, single letters, or generic phrases
  • Don’t exceed 100 terms total

For detailed guidance, see Keyterms prompting.

How Do I Update Configuration Mid-Session?

You can update prompt, keyterms_prompt, min_turn_silence, and max_turn_silence during an active session using UpdateConfiguration. This is one of Universal-3 Pro Streaming’s most powerful features for voice agents.

Dynamic keyterms by conversation stage

As your voice agent moves through different stages, update keyterms to match what the user is likely to say:

1# Caller identification stage
2ws.send(json.dumps({
3 "type": "UpdateConfiguration",
4 "keyterms_prompt": ["Kelly Byrne-Donoghue", "date of birth", "January", "February"]
5}))
6
7# Medical intake stage
8ws.send(json.dumps({
9 "type": "UpdateConfiguration",
10 "keyterms_prompt": ["cardiology", "echocardiogram", "Dr. Patel", "metoprolol"]
11}))
12
13# Payment stage — also increase max_turn_silence for credit card dictation
14ws.send(json.dumps({
15 "type": "UpdateConfiguration",
16 "keyterms_prompt": ["Visa", "Mastercard", "American Express"],
17 "max_turn_silence": 3000
18}))

Dynamic prompting

You can also update the transcription prompt mid-session. This is especially powerful when paired with tool calls in your LLM:

  • If your agent asks a yes/no question, prompt the model to anticipate short responses
  • If your agent asks for a phone number or email, prompt it to expect those formats
  • If you present a list of options, boost those options in the prompt
1# After asking "Would you like to confirm your appointment?"
2ws.send(json.dumps({
3 "type": "UpdateConfiguration",
4 "prompt": "User is responding yes or no to a confirmation question. Expect short responses."
5}))
6
7# After asking "What's your phone number?"
8ws.send(json.dumps({
9 "type": "UpdateConfiguration",
10 "prompt": "User is dictating a phone number. Expect digits and formatting.",
11 "max_turn_silence": 3000
12}))

How Do I Use Speaker Diarization?

Streaming Diarization identifies and labels individual speakers in real time. Each Turn event includes a speaker_label field (e.g., "A", "B") indicating which speaker produced that transcript.

Enable it by adding speaker_labels: true to your connection parameters:

1CONNECTION_PARAMS = {
2 "speech_model": "u3-rt-pro",
3 "speaker_labels": True,
4}

Speaker accuracy improves over the course of a session as the model accumulates embedding context.

With LiveKit:

1stt = assemblyai.STT(
2 model="u3-rt-pro",
3 speaker_labels=True,
4)

With Pipecat (including custom formatting):

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 connection_params=AssemblyAIConnectionParams(
4 speech_model="u3-rt-pro",
5 speaker_labels=True,
6 ),
7 speaker_format="[{speaker}] {text}",
8)

For more details, see Streaming Diarization and Multichannel.

How Do I Optimize for Latency?

Key optimizations

1. Use the right silence thresholds

Start with min_turn_silence=100 and max_turn_silence=1000. Only increase if you’re seeing entity splitting issues.

2. Eliminate additive delays in your orchestrator

In LiveKit with turn_detection="stt", set min_endpointing_delay=0 — LiveKit’s default 0.5s delay is additive on top of AssemblyAI’s own endpointing.

3. Use 16kHz sample rate

This balances audio quality and bandwidth. Higher sample rates don’t improve accuracy.

4. Align VAD thresholds

Mismatched VAD thresholds between your local VAD and AssemblyAI create a dead zone that delays interruption. Set both to 0.3.

5. Skip unnecessary features

Only enable speaker_labels if you need diarization. Only use keyterms_prompt if you have domain-specific terms. Each feature adds marginal processing overhead.

Latency breakdown

StageTypical latencyNotes
Audio to AssemblyAI~50msNetwork dependent
Speech-to-text~200-300msSub-300ms P50
min_turn_silence check100ms+Configurable
max_turn_silence fallback1000ms+Only if no terminal punctuation

How Does the Message Sequence Work?

Universal-3 Pro Streaming sends messages in a specific sequence. Here’s what a typical conversation looks like:

1. Session begins

1{
2 "type": "Begin",
3 "id": "session-id",
4 "expires_at": 1759796682
5}

2. Speech detected

1{
2 "type": "SpeechStarted",
3 "timestamp": 1200,
4 "confidence": 0.85
5}

3. Partial transcript (during silence, no terminal punctuation)

1{
2 "type": "Turn",
3 "turn_order": 0,
4 "end_of_turn": false,
5 "turn_is_formatted": false,
6 "transcript": "Yeah my credit card number is--"
7}

4. Final transcript (terminal punctuation found, or max_turn_silence reached)

1{
2 "type": "Turn",
3 "turn_order": 0,
4 "end_of_turn": true,
5 "turn_is_formatted": true,
6 "transcript": "Yeah, my credit card number is 8888-8888-8888-8888.",
7 "speaker_label": "A"
8}

For Universal-3 Pro Streaming, end_of_turn and turn_is_formatted always have the same value. You can reliably use end_of_turn: true to detect a formatted, final transcript.

5. Session termination

1{
2 "type": "Termination",
3 "audio_duration_seconds": 45.2
4}

For the complete message reference, see Message sequence.

How Can I Improve Accuracy?

Keyterms prompting

The single most effective way to improve accuracy on domain-specific terms. See How Can I Use Prompting to Improve Accuracy? above.

Dynamic configuration updates

Update keyterms and prompts mid-session based on conversation context. See How Do I Update Configuration Mid-Session? above.

Tune silence thresholds

If entities are splitting across turns, increase min_turn_silence (for punctuation-triggered splits) or max_turn_silence (for forced timeout splits). You can do this dynamically mid-session for specific conversation stages like entity dictation.

Noise handling

Universal-3 Pro Streaming handles background noise well out of the box. Avoid adding noise cancellation as a preprocessing step — the artifacts it introduces typically cause more harm than the background noise itself.

Scaling and Concurrency

Universal-3 Pro Streaming provides unlimited concurrent streams:

  • No hard caps on simultaneous connections
  • No overage fees for spike traffic
  • Automatic scaling from 5 to 50,000+ streams

Rate limits:

  • Free users: 5 new streams per minute
  • Pay-as-you-go: 100 new streams per minute
  • When using 70%+ of your limit, capacity automatically increases 10% every 60 seconds

These limits are designed to never interfere with legitimate applications. Your baseline limit is guaranteed and never decreases, so you can scale smoothly without artificial barriers.


Additional Resources