Overview
This guide covers integrating AssemblyAI’s Universal 3.5 Pro Realtime speech-to-text model into a Pipecat voice agent. Everything here applies equally to Universal-3 Pro Streaming (u3-rt-pro) — both belong to the same U3 Pro family and share every parameter in this guide, so you can swap the model string without changing anything else.
Universal 3.5 Pro Realtime is our flagship next-generation streaming model for voice agents — multilingual and promptable, with conversation context and voice focus.Available on Pipecat 1.4.0+ — set
model="universal-3-5-pro".Turn detection
Decide when the user is done speaking — the two Pipecat modes, defaults, and entity tuning.
Latency
Shorten the gap between the user finishing and the agent replying.
Accuracy
Prompting, key terms, conversation context, and noise handling.
Interruptions
Natural barge-in while the agent is speaking.
Pipecat AssemblyAI STT plugin
View Pipecat’s AssemblyAI STT plugin reference.
Quickstart
Get a working, talking agent in a few minutes, then optimize from there.Install Pipecat
Install Pipecat with the AssemblyAI, LLM, and TTS extras you need:What’s included:
assemblyai: AssemblyAI U3 Pro STT serviceopenai: OpenAI LLM service (used in the example)cartesia: Cartesia TTS service (used in the example)
Build a minimal agent
The example below uses Pipecat-controlled turn detection (the default). Pay attention to the comments for switching to AssemblyAI’s built-in turn detection, and note that the assistant aggregator at the end of the pipeline is what enables automatic conversation context.
Run and test
Run the agent directly with local audio:Speak into your microphone after hearing the greeting. For WebRTC or Daily testing, see Running your agent.
Parameters reference
Universal 3.5 Pro Realtime parameters
These are the key parameters to tune. Set them insideAssemblyAISTTService.Settings(...). They apply to the whole U3 Pro family (universal-3-5-pro and u3-rt-pro).
The streaming model.
"universal-3-5-pro" is the recommended flagship model;
the plugin currently defaults to "u3-rt-pro", so set model explicitly. Both
belong to the U3 Pro family and share every parameter below.Accuracy/latency preset:
"min_latency", "balanced", or "max_accuracy".
Sets sensible defaults for mode-dependent fields; any value you set explicitly
still takes precedence. The server defaults to "balanced". Construction-time
only. U3 Pro family only. See Optimizing accuracy and
latency.List of terms to boost recognition for. Used on its own, your terms are
appended to the default prompt automatically. Can’t be set in the same request
as
prompt — see Key terms to combine boosting with a custom
prompt.Contextual prompt — a natural-language description of what the audio is about
(domain, scenario, or full details). Can’t be set in the same request as
keyterms_prompt; fold the terms into the prompt text instead (see
Key terms). Prompting is currently a beta feature: see
Prompting for more information.Context carryover seed — your agent’s most recent spoken reply, up to ~1500
characters, used to transcribe the next user turn more accurately. Set it at
construction time to seed an opening greeting; later turns are fed
automatically. U3 Pro family only. See Conversation
context.
How many prior conversation entries are carried forward automatically. Range
0–100; 0 disables carryover entirely (including the automatic
agent_context feed). Construction-time only; leave unset for the server
default (3). U3 Pro family only.Milliseconds of silence before a speculative end-of-turn check. When the check
fires, the model looks for terminal punctuation (
. ? !) to decide whether
the turn has ended. (Formerly min_end_of_turn_silence_when_confident,
deprecated but still supported with a warning.)Maximum silence before the turn is forced to end, regardless of punctuation.
Auto-synced to
min_turn_silence in Pipecat mode; respected as configured
in AssemblyAI’s built-in turn detection mode.AssemblyAI’s internal VAD threshold (
0.0–1.0) for classifying audio frames
as silence. Align with your local VAD’s activation threshold to avoid a “dead
zone” where AssemblyAI transcribes speech your VAD hasn’t detected yet.Server-side noise suppression that isolates the primary speaker.
"near-field" for close-talking mics, "far-field" for distant capture.
Construction-time only. U3 Pro family only. See Voice focus.How aggressively
voice_focus suppresses background audio. 0.0–1.0; higher
is more aggressive. Only takes effect when voice_focus is set.
Construction-time only. U3 Pro family only.Whether to emit additional partial transcripts during long turns at a steady
~3 second cadence. When enabled (default on both the API and this plugin),
additional partials covering the full turn transcript are emitted
approximately every 3 seconds while speech continues. When disabled, only one
early partial is emitted near turn start. The first partial (at 750ms) is
unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need
frequent updates during long, uninterrupted turns. See
Continuous partials
for details.
How soon the first partial transcript is emitted during a turn, in
milliseconds. Range:
0–1000. Lower values produce faster time to first
token (TTFT) for barge-in and speculative inference; higher values produce
more confident first partials. The server adds a minimum of 300ms on top of
the configured value (interruption_delay=0 → ~300ms effective,
interruption_delay=500 → ~800ms effective). See
Tuning early partial timing
for details.Universal 3.5 Pro Realtime code-switches natively between supported
languages. This parameter controls whether
language_code and
language_confidence are included in turn messages.Enable speaker diarization. See Speaker diarization.
General parameters
These apply across models and Pipecat setups.api_key, vad_force_turn_endpoint, should_interrupt, and speaker_format are passed directly to AssemblyAISTTService(...), not inside Settings.
Your AssemblyAI API key.
True for Pipecat mode (VAD + Smart Turn controls turns); False for
AssemblyAI’s built-in turn detection (u3-rt-pro / universal-3-5-pro only).
See Turn detection.Whether the user starting to speak interrupts the bot. Only applies in
AssemblyAI’s built-in turn detection mode (
vad_force_turn_endpoint=False).Template string for formatting speaker labels (e.g.,
"[{speaker}] {text}").
Used with speaker_labels.The sample rate of the audio stream.
The encoding of the audio stream. Allowed values:
pcm_s16le, pcm_mulaw.Legacy parameters
These apply to theuniversal-streaming-english and universal-streaming-multilingual models, but do not affect Universal 3.5 Pro Realtime or u3-rt-pro:
Confidence threshold for end-of-turn detection. The U3 Pro family uses
punctuation-based turn detection instead, so this parameter has no effect.
Whether to return formatted final transcripts. The U3 Pro family always
returns formatted transcripts, so this parameter no longer applies.
Turn detection
In Pipecat, you choose which component decides when the user is done speaking with thevad_force_turn_endpoint flag on AssemblyAISTTService. The U3 Pro family uses a punctuation-based end-of-turn system: after a period of silence, the model checks for terminal punctuation (. ? !) rather than a confidence score. For more on how this works, see Configuring turn detection.
The
vad_force_turn_endpoint parameter controls which turn detection mode is
used. It defaults to True (Pipecat mode), which sends a ForceEndpoint
message to AssemblyAI when the local VAD detects silence. Set it to False to
use AssemblyAI’s built-in turn detection instead. Choosing the right mode is
critical for balancing responsiveness and turn accuracy in your voice agent.Pipecat mode (default, recommended)
When to use: Most voice agent applications requiring responsive interruptions.- VAD + the Smart Turn analyzer control when the user is done speaking.
- A
ForceEndpointmessage is sent to AssemblyAI on VAD silence detection. max_turn_silenceis automatically synchronized withmin_turn_silence.- Best for low-latency, responsive voice agents.
AssemblyAI’s built-in turn detection
When to use: When you want AssemblyAI’s punctuation-based turn detection to control turn endings, configured through the settings below.- User speaks → audio streams to AssemblyAI.
- User pauses for
min_turn_silence(e.g.,100ms) → the model checks for terminal punctuation. - If terminal punctuation (
.?!) is found → the turn ends immediately. - If not → a partial is emitted and the turn continues waiting.
- If silence reaches
max_turn_silence(e.g.,1000ms) → the turn is forced to end regardless.
UserStartedSpeakingFrame / UserStoppedSpeakingFrame, and SpeechStarted events drive fast barge-in. Only available with u3-rt-pro / universal-3-5-pro (other models require Pipecat mode).
Entity splitting tradeoff
Lowermin_turn_silence and max_turn_silence values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.
min_turn_silence too low
The speculative check fires too early, splitting entities on punctuation:
max_turn_silence too low
The forced turn-end cuts off the user mid-thought:
Universal 3.5 Pro Realtime’s formatting is significantly better when it has
full context in a single turn — email addresses, phone numbers, credit card
numbers, and physical addresses all benefit. If your use case involves
alphanumeric dictation, raise
max_turn_silence during those portions of the
conversation (e.g., to 2000–4000 ms) using dynamic
configuration, then lower it again afterward. In
Pipecat mode, raise min_turn_silence (which max_turn_silence follows) for
the same effect.Latency
A voice agent feels responsive when the gap between the user finishing and the agent replying is short. Start with themode preset — the highest-level dial for the accuracy/latency trade-off. It sets sensible defaults for the fine-grained levers below, so you can pick a target and tune from there:
mode is set at construction time (it can’t be changed mid-session) and influences the defaults of the levers below. Any value you set explicitly still wins. Leave it unset to use the server’s default preset. See Optimizing accuracy and latency.
From there, fine-tune the individual levers:
- End-of-turn timing.
min_turn_silence(speculative check) andmax_turn_silence(forced end) directly control how soon a turn ends. Lower is faster but risks splitting entities — see Turn detection. - Time to first partial.
interruption_delaycontrols how soon the first partial is emitted, which drives faster barge-in and speculative inference. The server adds a minimum of300mson top of the configured value. - Sample rate. Use 16 kHz (
sample_rate=16000). Higher rates don’t improve accuracy and only add bandwidth. - Continuous partials.
continuous_partials(on by default) emits a partial every ~3 seconds during long turns. Leave it on for steady mid-turn updates, or disable it if you only need a single early partial. - Skip client-side preprocessing. Don’t run your own noise cancellation before audio reaches the model — the artifacts it introduces usually hurt accuracy more than the original noise. Use server-side Voice Focus instead.
Latency breakdown
| Stage | Typical | Controlled by |
|---|---|---|
| Network round trip | ~50 ms | — |
| Speech-to-text | ~200–300 ms | model |
| First partial (TTFT) | configured interruption_delay + ~300 ms server min | interruption_delay |
| End of turn (terminal punctuation found) | min_turn_silence (default 100 ms) | min_turn_silence |
| End of turn (no punctuation, forced) | up to max_turn_silence | max_turn_silence |
Accuracy
Universal 3.5 Pro Realtime is accurate out of the box. When you need more — domain vocabulary, proper nouns, noisy audio — reach for these levers. For entity-heavy dictation, also tune turn detection (see Entity splitting tradeoff), and note that the high-levelmode preset shifts the overall accuracy/latency balance (use max_accuracy to favor quality).
Prompting
Universal 3.5 Pro Realtime supports aprompt parameter for contextual prompting — a description of what the audio is about. Transcription behavior (verbatim output, punctuation, turn detection) is built in and optimized automatically; the prompt carries context, not instructions.
Key terms
Usekeyterms_prompt to boost recognition of specific names, brands, or domain terms. On its own, your terms are appended to the default prompt automatically — so you get boosting and prompting together:
You can’t pass
prompt and keyterms_prompt in the same request — doing so
raises a validation error. You don’t have to give up term boosting to use a
contextual prompt, though. Either:- Pass
keyterms_prompton its own — your terms are appended to the default prompt automatically, or - Fold the terms into a custom
prompt, e.g. end it with"Make sure to boost the words Xiomara, Saoirse, Pipecat in the audio."
Conversation context
Give the model both sides of the dialog so it transcribes the next user turn more accurately. Universal 3.5 Pro Realtime keeps a short, per-session memory of the conversation from two sources:- The agent half — what your agent just said.
- The user half — prior STT-finalized user turns.
"What's your email address?", the model can produce "user@assemblyai.com" instead of "user at assemblyai dot com". This has the biggest impact on short replies ("yes", "7pm", single names) and spelled-out entities. See Conversation context for the full reference.
In Pipecat, conversation context is automatic — no event wiring required.
As long as your pipeline includes the standard LLM context aggregator (the
assistant_aggregator from LLMContextAggregatorPair), Pipecat broadcasts an
LLMContextAssistantTurnFrame when each bot turn completes, and
AssemblyAISTTService feeds that reply to the model as agent_context
automatically. Just use a U3 Pro family model on pipecat-ai 1.4.0+.| Parameter | Type | Description |
|---|---|---|
agent_context | str | Your agent’s most recent spoken reply, up to ~1500 characters. Set it at construction time to seed an opening greeting; subsequent replies are fed automatically. |
previous_context_n_turns | int | How many prior conversation entries are carried forward automatically. Range 0–100; 0 disables carryover entirely. Construction-time only; server default is 3. |
Seeding the opening greeting
The automatic feed kicks in once your agent completes its first turn. To give the model context for the user’s very first reply (the answer to your greeting), setagent_context at construction time:
Manual control with update_agent_context()
If your pipeline doesn’t use the standard LLM context aggregator, or you want explicit control over what the model sees, push the agent’s reply yourself. This is a live update — no reconnect required:
agent_context, previous_context_n_turns, and update_agent_context() are
supported only on the U3 Pro family (universal-3-5-pro, u3-rt-pro). Values
are clipped to ~1500 characters and re-seeded automatically on reconnect.
Setting previous_context_n_turns=0 disables the automatic feed as well.Voice focus
Voice Focus isolates the primary speaker and suppresses background noise — chatter, keyboard clicks, fan hum, room echo — server-side, before audio reaches the model. Use it instead of client-side noise cancellation, which tends to introduce artifacts that hurt accuracy more than the noise itself.| Parameter | Type | Description |
|---|---|---|
voice_focus | str | "near-field" for headsets, handsets, and other close-talking mics; "far-field" for conference rooms, laptop mics, and other distant capture. |
voice_focus_threshold | float | Optional. 0.0–1.0; higher values suppress background audio more aggressively. |
Interruption handling
Barge-in — the user interrupting while the agent is speaking — is handled by Pipecat, and the signals that drive it depend on your turn detection mode.- Pipecat mode (
vad_force_turn_endpoint=True). Pipecat’s local VAD and the Smart Turn analyzer detect the user starting to speak and interrupt the bot’s TTS. AssemblyAI also emitsSpeechStartedevents as a backstop. - AssemblyAI’s built-in turn detection (
vad_force_turn_endpoint=False). The service emitsUserStartedSpeakingFrame/UserStoppedSpeakingFrameand uses AssemblyAI’sSpeechStartedevents for fast barge-in. Setshould_interrupt=False(constructor argument) to disable barge-in entirely in this mode.
"mhm", "yeah", "okay"), keep your VAD threshold aligned with vad_threshold and lean on Pipecat’s Smart Turn analyzer, which evaluates whether speech is a genuine turn rather than a filler.
Dynamic configuration
Update settings mid-conversation by queueing anSTTUpdateSettingsFrame with a settings delta — adapt to the conversation stage as it unfolds. See stt-assemblyai.py for a complete working example.
| Conversation stage | Adjustment |
|---|---|
| Caller identification (names, account IDs) | Boost terms with keyterms_prompt |
| Entity dictation (email, phone, address) | Raise max_turn_silence to ~2000–4000 ms, then lower it again afterward |
| After each agent reply | Automatic — or push agent_context via update_agent_context() |
| Faster barge-in | Lower interruption_delay |
Speaker diarization
Identify different speakers in multi-party conversations.Basic diarization
"A", "B", "C") are included in final transcripts.
With custom formatting
Format transcripts with speaker labels for LLM context:| Style | Format string |
|---|---|
| XML | <{speaker}>{text}</{speaker}> |
| Markdown | **{speaker}**: {text} |
| Bracket | [{speaker}] {text} |
Running your agent
Development mode (local audio)
Production with Daily
For production deployments, use the Daily transport for WebRTC-based real-time audio/video. Your agent joins a Daily room as a participant and handles audio I/O through Daily’s infrastructure.Telephony with Telnyx
When bridging phone calls through Pipecat (e.g., via Telnyx), the audio is 8 kHz, not 16 kHz. Match the transport sample rates:Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
universal-3-5-pro not recognized | pipecat-ai older than 1.4.0 | Upgrade: pip install -U "pipecat-ai[assemblyai]" |
| Turn over-segmentation | min_turn_silence too low | Increase from 100 to 200–500 |
| Entities split across turns | max_turn_silence too low (AssemblyAI mode) | Increase max_turn_silence (e.g., 1500–3500); in Pipecat mode, raise min_turn_silence |
| Latency on non-terminal utterances | max_turn_silence too high | Lower max_turn_silence |
| Conversation context has no effect | Non-U3-Pro model, or previous_context_n_turns=0 | Use a U3 Pro family model and leave previous_context_n_turns unset (or > 0) |
| Mid-session setting change drops audio | Reconnect on a non-agent_context setting change | Expected — only agent_context updates live; use update_agent_context() for context |
| Mis-heard names, brands, or jargon | No vocabulary hints | Add keyterms_prompt, or supply prompt/agent_context for context |
| Poor accuracy in noisy audio | Background noise or room echo | Enable voice_focus (near-field or far-field) |
Migrating from another STT provider
To balance accuracy, latency, turn-taking, and interruption handling, map your current setup to AssemblyAI using the questions below.How are you detecting end-of-turn today?
| Today | Recommended on AssemblyAI |
|---|---|
| Your STT provider’s own end-of-turn model | AssemblyAI’s built-in turn detection: vad_force_turn_endpoint=False with min_turn_silence=100, max_turn_silence=1000. |
| Silence / VAD only, with your own turn logic | Pipecat mode (vad_force_turn_endpoint=True, default). VAD + Smart Turn decide turns; AssemblyAI returns finals ASAP. |
| You want the framework to own turn-taking | Pipecat mode (default) — Pipecat’s Smart Turn analyzer makes the turn decision. |
Which model and settings are you migrating from?
| What you pass today | AssemblyAI equivalent |
|---|---|
| Current model (Deepgram, ElevenLabs, etc.) | model="universal-3-5-pro" (recommended flagship) or "u3-rt-pro" |
| Overall accuracy/latency tuning | mode="min_latency" / "balanced" / "max_accuracy" — a one-line starting point before fine-tuning |
| Endpointing / silence thresholds | min_turn_silence (speculative end-of-turn) and max_turn_silence (forced end) |
| Custom vocabulary / keywords | keyterms_prompt=[...]; broader domain context → prompt |
| Provider-side conversation context | Automatic — include the LLM context aggregator; seed greetings via agent_context |
| Formatting / punctuation toggles | On by default — formatted transcripts always (format_turns does not apply) |
| Telephony / SIP routing | sample_rate=8000 and encoding="pcm_mulaw" for 8 kHz telephony |
| Client-side noise cancellation | Drop it; use server-side Voice Focus instead |
Speech model comparison
Interested in using a different model?| Feature | U3 Pro family ( universal-3-5-pro, u3-rt-pro) | universal-streaming-english | universal-streaming-multilingual |
|---|---|---|---|
| Turn Detection Modes | |||
| Pipecat mode (VAD + Smart Turn) | ✅ | ✅ | ✅ |
| AssemblyAI turn detection mode | ✅ | ❌ | ❌ |
| Turn Detection Parameters | |||
min_turn_silence | ✅ | ✅ | ✅ |
max_turn_silence | ✅ | ✅ | ✅ |
end_of_turn_confidence_threshold | ❌ | ✅ (1.0) | ✅ (1.0) |
continuous_partials | ✅ | ❌ | ❌ |
interruption_delay | ✅ | ❌ | ❌ |
| Advanced Features | |||
| Keyterms boosting | ✅ | ✅ | ✅ |
| Custom prompting (beta) | ✅ | ❌ | ❌ |
| Conversation context (carryover) | ✅ | ❌ | ❌ |
| Voice Focus | ✅ | ❌ | ❌ |
| Speaker diarization | ✅ | ✅ | ✅ |
| Dynamic parameter updates | ✅ | ✅ | ✅ |
| Language Support | |||
| Multilingual code switching | ✅ | ❌ | ✅ |
| Language detection | ✅ | ❌ | ✅ |
- ✅ Fully supported and recommended
- ❌ Not supported / Not used
The U3 Pro family is recommended for all new voice agent implementations.
The universal-streaming models are maintained for backward compatibility but
lack the optimizations and features specifically designed for real-time
conversational AI.