Skip to main content

Overview

This guide covers integrating AssemblyAI’s Universal 3.5 Pro Realtime speech-to-text model into a Pipecat voice agent. Everything here applies equally to Universal-3 Pro Streaming (u3-rt-pro) — both belong to the same U3 Pro family and share every parameter in this guide, so you can swap the model string without changing anything else.
Universal 3.5 Pro Realtime is our flagship next-generation streaming model for voice agents — multilingual and promptable, with conversation context and voice focus.Available on Pipecat 1.4.0+ — set model="universal-3-5-pro".
AssemblyAI provides the speech-to-text and (optionally) the turn detection in your Pipecat pipeline: Once you have an agent running, tune it for what matters most to your use case:

Turn detection

Decide when the user is done speaking — the two Pipecat modes, defaults, and entity tuning.

Latency

Shorten the gap between the user finishing and the agent replying.

Accuracy

Prompting, key terms, conversation context, and noise handling.

Interruptions

Natural barge-in while the agent is speaking.

Pipecat AssemblyAI STT plugin

View Pipecat’s AssemblyAI STT plugin reference.

Quickstart

Get a working, talking agent in a few minutes, then optimize from there.
1

Install Pipecat

Install Pipecat with the AssemblyAI, LLM, and TTS extras you need:
pip install "pipecat-ai[assemblyai,openai,cartesia]" python-dotenv
What’s included:
  • assemblyai: AssemblyAI U3 Pro STT service
  • openai: OpenAI LLM service (used in the example)
  • cartesia: Cartesia TTS service (used in the example)
The example uses OpenAI and Cartesia, but you can use any LLM or TTS supported by Pipecat — just swap the extras (e.g., pipecat-ai[assemblyai,anthropic,elevenlabs]).
Universal 3.5 Pro Realtime, automatic conversation context, and Voice Focus require pipecat-ai 1.4.0+. Older versions won’t recognize the universal-3-5-pro model.
2

Set your API keys

Set your API keys in a .env file:
ASSEMBLYAI_API_KEY=your_assemblyai_key
OPENAI_API_KEY=your_openai_key
CARTESIA_API_KEY=your_cartesia_key
You can obtain an AssemblyAI API key by signing up here and navigating to the API Keys tab of the dashboard.
3

Build a minimal agent

The example below uses Pipecat-controlled turn detection (the default). Pay attention to the comments for switching to AssemblyAI’s built-in turn detection, and note that the assistant aggregator at the end of the pipeline is what enables automatic conversation context.
import os

from dotenv import load_dotenv
from loguru import logger

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.worker import PipelineParams, PipelineWorker
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair,
    LLMUserAggregatorParams,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.assemblyai.stt import AssemblyAISTTService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.workers.runner import WorkerRunner

load_dotenv()

transport_params = {
    "daily": lambda: DailyParams(audio_in_enabled=True, audio_out_enabled=True),
    "webrtc": lambda: TransportParams(audio_in_enabled=True, audio_out_enabled=True),
}


async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
    stt = AssemblyAISTTService(
        api_key=os.environ["ASSEMBLYAI_API_KEY"],
        settings=AssemblyAISTTService.Settings(
            model="universal-3-5-pro",
            min_turn_silence=100,
            # max_turn_silence is auto-synced to min_turn_silence in Pipecat mode.
            # vad_threshold=0.3,            # Align with your local VAD's threshold
            # continuous_partials=True,     # Default — steady ~3s partials during long turns
            # interruption_delay=0,         # Optional: faster first partial (~300ms effective)
        ),
        vad_force_turn_endpoint=True,  # Pipecat mode (default).
        # Set False to use AssemblyAI's built-in turn detection (u3-rt-pro / universal-3-5-pro only):
        # vad_force_turn_endpoint=False,
    )

    llm = OpenAILLMService(api_key=os.environ["OPENAI_API_KEY"])
    tts = CartesiaTTSService(api_key=os.environ["CARTESIA_API_KEY"])

    context = LLMContext()
    user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
    )

    pipeline = Pipeline(
        [
            transport.input(),     # Transport user input
            stt,                   # STT
            user_aggregator,       # User responses
            llm,                   # LLM
            tts,                   # TTS
            transport.output(),    # Transport bot output
            assistant_aggregator,  # Assistant responses → automatic conversation context
        ]
    )

    worker = PipelineWorker(pipeline, params=PipelineParams(enable_metrics=True))

    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        context.add_message(
            {"role": "system", "content": "You are a helpful voice assistant. Keep replies brief and speakable."}
        )
        await worker.queue_frames([LLMRunFrame()])

    runner = WorkerRunner(handle_sigint=runner_args.handle_sigint)
    await runner.add_workers(worker)
    await runner.run()


async def bot(runner_args: RunnerArguments):
    transport = await create_transport(runner_args, transport_params)
    await run_bot(transport, runner_args)


if __name__ == "__main__":
    from pipecat.runner.run import main

    main()
Two complete, runnable examples live in the Pipecat repo: voice-assemblyai.py (Pipecat turn detection) and voice-assemblyai-turn-detection.py (AssemblyAI’s built-in turn detection).
4

Run and test

Run the agent directly with local audio:
python your_agent.py
Speak into your microphone after hearing the greeting. For WebRTC or Daily testing, see Running your agent.

Parameters reference

Universal 3.5 Pro Realtime parameters

These are the key parameters to tune. Set them inside AssemblyAISTTService.Settings(...). They apply to the whole U3 Pro family (universal-3-5-pro and u3-rt-pro).
model
str
default:"u3-rt-pro"
The streaming model. "universal-3-5-pro" is the recommended flagship model; the plugin currently defaults to "u3-rt-pro", so set model explicitly. Both belong to the U3 Pro family and share every parameter below.
mode
str
Accuracy/latency preset: "min_latency", "balanced", or "max_accuracy". Sets sensible defaults for mode-dependent fields; any value you set explicitly still takes precedence. The server defaults to "balanced". Construction-time only. U3 Pro family only. See Optimizing accuracy and latency.
keyterms_prompt
list[str]
List of terms to boost recognition for. Used on its own, your terms are appended to the default prompt automatically. Can’t be set in the same request as prompt — see Key terms to combine boosting with a custom prompt.
prompt
str
Contextual prompt — a natural-language description of what the audio is about (domain, scenario, or full details). Can’t be set in the same request as keyterms_prompt; fold the terms into the prompt text instead (see Key terms). Prompting is currently a beta feature: see Prompting for more information.
agent_context
str
Context carryover seed — your agent’s most recent spoken reply, up to ~1500 characters, used to transcribe the next user turn more accurately. Set it at construction time to seed an opening greeting; later turns are fed automatically. U3 Pro family only. See Conversation context.
previous_context_n_turns
int
default:"3"
How many prior conversation entries are carried forward automatically. Range 0100; 0 disables carryover entirely (including the automatic agent_context feed). Construction-time only; leave unset for the server default (3). U3 Pro family only.
min_turn_silence
int
default:"100"
Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation (. ? !) to decide whether the turn has ended. (Formerly min_end_of_turn_silence_when_confident, deprecated but still supported with a warning.)
max_turn_silence
int
default:"1000"
Maximum silence before the turn is forced to end, regardless of punctuation. Auto-synced to min_turn_silence in Pipecat mode; respected as configured in AssemblyAI’s built-in turn detection mode.
vad_threshold
float
default:"0.3"
AssemblyAI’s internal VAD threshold (0.01.0) for classifying audio frames as silence. Align with your local VAD’s activation threshold to avoid a “dead zone” where AssemblyAI transcribes speech your VAD hasn’t detected yet.
voice_focus
str
Server-side noise suppression that isolates the primary speaker. "near-field" for close-talking mics, "far-field" for distant capture. Construction-time only. U3 Pro family only. See Voice focus.
voice_focus_threshold
float
How aggressively voice_focus suppresses background audio. 0.01.0; higher is more aggressive. Only takes effect when voice_focus is set. Construction-time only. U3 Pro family only.
continuous_partials
bool
default:"True"
Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When enabled (default on both the API and this plugin), additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. When disabled, only one early partial is emitted near turn start. The first partial (at 750ms) is unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need frequent updates during long, uninterrupted turns. See Continuous partials for details.
interruption_delay
int
default:"500"
How soon the first partial transcript is emitted during a turn, in milliseconds. Range: 01000. Lower values produce faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. The server adds a minimum of 300ms on top of the configured value (interruption_delay=0 → ~300ms effective, interruption_delay=500 → ~800ms effective). See Tuning early partial timing for details.
language_detection
bool
Universal 3.5 Pro Realtime code-switches natively between supported languages. This parameter controls whether language_code and language_confidence are included in turn messages.
speaker_labels
bool
default:"False"
Enable speaker diarization. See Speaker diarization.

General parameters

These apply across models and Pipecat setups. api_key, vad_force_turn_endpoint, should_interrupt, and speaker_format are passed directly to AssemblyAISTTService(...), not inside Settings.
api_key
str
required
Your AssemblyAI API key.
vad_force_turn_endpoint
bool
default:"True"
True for Pipecat mode (VAD + Smart Turn controls turns); False for AssemblyAI’s built-in turn detection (u3-rt-pro / universal-3-5-pro only). See Turn detection.
should_interrupt
bool
default:"True"
Whether the user starting to speak interrupts the bot. Only applies in AssemblyAI’s built-in turn detection mode (vad_force_turn_endpoint=False).
speaker_format
str
Template string for formatting speaker labels (e.g., "[{speaker}] {text}"). Used with speaker_labels.
sample_rate
int
default:"16000"
The sample rate of the audio stream.
encoding
str
default:"pcm_s16le"
The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw.

Legacy parameters

These apply to the universal-streaming-english and universal-streaming-multilingual models, but do not affect Universal 3.5 Pro Realtime or u3-rt-pro:
end_of_turn_confidence_threshold
float
Confidence threshold for end-of-turn detection. The U3 Pro family uses punctuation-based turn detection instead, so this parameter has no effect.
format_turns
bool
default:"True"
Whether to return formatted final transcripts. The U3 Pro family always returns formatted transcripts, so this parameter no longer applies.

Turn detection

In Pipecat, you choose which component decides when the user is done speaking with the vad_force_turn_endpoint flag on AssemblyAISTTService. The U3 Pro family uses a punctuation-based end-of-turn system: after a period of silence, the model checks for terminal punctuation (. ? !) rather than a confidence score. For more on how this works, see Configuring turn detection.
The vad_force_turn_endpoint parameter controls which turn detection mode is used. It defaults to True (Pipecat mode), which sends a ForceEndpoint message to AssemblyAI when the local VAD detects silence. Set it to False to use AssemblyAI’s built-in turn detection instead. Choosing the right mode is critical for balancing responsiveness and turn accuracy in your voice agent.
When to use: Most voice agent applications requiring responsive interruptions.
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        min_turn_silence=100,
    ),
    vad_force_turn_endpoint=True,  # Default (Pipecat mode)
)
How it works:
  • VAD + the Smart Turn analyzer control when the user is done speaking.
  • A ForceEndpoint message is sent to AssemblyAI on VAD silence detection.
  • max_turn_silence is automatically synchronized with min_turn_silence.
  • Best for low-latency, responsive voice agents.

AssemblyAI’s built-in turn detection

When to use: When you want AssemblyAI’s punctuation-based turn detection to control turn endings, configured through the settings below.
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        min_turn_silence=100,
        max_turn_silence=1000,  # Now respected independently
    ),
    vad_force_turn_endpoint=False,  # AssemblyAI's built-in turn detection
)
How it works:
  1. User speaks → audio streams to AssemblyAI.
  2. User pauses for min_turn_silence (e.g., 100ms) → the model checks for terminal punctuation.
  3. If terminal punctuation (. ? !) is found → the turn ends immediately.
  4. If not → a partial is emitted and the turn continues waiting.
  5. If silence reaches max_turn_silence (e.g., 1000ms) → the turn is forced to end regardless.
In this mode all timing parameters are respected as configured, the service emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame, and SpeechStarted events drive fast barge-in. Only available with u3-rt-pro / universal-3-5-pro (other models require Pipecat mode).

Entity splitting tradeoff

Lower min_turn_silence and max_turn_silence values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.

min_turn_silence too low

The speculative check fires too early, splitting entities on punctuation:
# With (min_turn_silence=100, max_turn_silence=1000)
"It's John."                    → FINAL (100ms pause, check fires, period found → turn ends)
"Smith."                        → FINAL
"At gmail.com."                 → FINAL

# With (min_turn_silence=400, max_turn_silence=1000)
"It's john.smith@gmail.com."    → FINAL (single turn, properly formatted)

max_turn_silence too low

The forced turn-end cuts off the user mid-thought:
# With (min_turn_silence=100, max_turn_silence=1000)
"I wanted to check on my order from..."  → FINAL (1000ms silence, forced end)
"last Tuesday, order number 4829."       → FINAL (new turn)

# With (min_turn_silence=100, max_turn_silence=2000)
"I wanted to check on my order from last Tuesday, order number 4829."  → FINAL (single turn)
Universal 3.5 Pro Realtime’s formatting is significantly better when it has full context in a single turn — email addresses, phone numbers, credit card numbers, and physical addresses all benefit. If your use case involves alphanumeric dictation, raise max_turn_silence during those portions of the conversation (e.g., to 20004000 ms) using dynamic configuration, then lower it again afterward. In Pipecat mode, raise min_turn_silence (which max_turn_silence follows) for the same effect.

Latency

A voice agent feels responsive when the gap between the user finishing and the agent replying is short. Start with the mode preset — the highest-level dial for the accuracy/latency trade-off. It sets sensible defaults for the fine-grained levers below, so you can pick a target and tune from there:
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        mode="balanced",  # "min_latency" (fastest) · "balanced" · "max_accuracy" (best quality)
    ),
)
mode is set at construction time (it can’t be changed mid-session) and influences the defaults of the levers below. Any value you set explicitly still wins. Leave it unset to use the server’s default preset. See Optimizing accuracy and latency. From there, fine-tune the individual levers:
  • End-of-turn timing. min_turn_silence (speculative check) and max_turn_silence (forced end) directly control how soon a turn ends. Lower is faster but risks splitting entities — see Turn detection.
  • Time to first partial. interruption_delay controls how soon the first partial is emitted, which drives faster barge-in and speculative inference. The server adds a minimum of 300ms on top of the configured value.
  • Sample rate. Use 16 kHz (sample_rate=16000). Higher rates don’t improve accuracy and only add bandwidth.
  • Continuous partials. continuous_partials (on by default) emits a partial every ~3 seconds during long turns. Leave it on for steady mid-turn updates, or disable it if you only need a single early partial.
  • Skip client-side preprocessing. Don’t run your own noise cancellation before audio reaches the model — the artifacts it introduces usually hurt accuracy more than the original noise. Use server-side Voice Focus instead.

Latency breakdown

StageTypicalControlled by
Network round trip~50 ms
Speech-to-text~200–300 msmodel
First partial (TTFT)configured interruption_delay + ~300 ms server mininterruption_delay
End of turn (terminal punctuation found)min_turn_silence (default 100 ms)min_turn_silence
End of turn (no punctuation, forced)up to max_turn_silencemax_turn_silence

Accuracy

Universal 3.5 Pro Realtime is accurate out of the box. When you need more — domain vocabulary, proper nouns, noisy audio — reach for these levers. For entity-heavy dictation, also tune turn detection (see Entity splitting tradeoff), and note that the high-level mode preset shifts the overall accuracy/latency balance (use max_accuracy to favor quality).

Prompting

Beta featurePrompting is considered a beta feature for Universal 3.5 Pro Realtime.While it can be a powerful tool for improving accuracy in certain use cases, we recommend starting without a prompt to first establish baseline performance. Once the baseline has been tested, you can add context to further optimize for your use case (e.g., language mix to expect, use case or domain).
Universal 3.5 Pro Realtime supports a prompt parameter for contextual prompting — a description of what the audio is about. Transcription behavior (verbatim output, punctuation, turn detection) is built in and optimized automatically; the prompt carries context, not instructions.
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        prompt="Customer support call about an internet service outage.",
    ),
)

Key terms

Use keyterms_prompt to boost recognition of specific names, brands, or domain terms. On its own, your terms are appended to the default prompt automatically — so you get boosting and prompting together:
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        keyterms_prompt=["Xiomara", "Saoirse", "Pipecat", "AssemblyAI"],
    ),
)
You can’t pass prompt and keyterms_prompt in the same request — doing so raises a validation error. You don’t have to give up term boosting to use a contextual prompt, though. Either:
  • Pass keyterms_prompt on its own — your terms are appended to the default prompt automatically, or
  • Fold the terms into a custom prompt, e.g. end it with "Make sure to boost the words Xiomara, Saoirse, Pipecat in the audio."

Conversation context

Give the model both sides of the dialog so it transcribes the next user turn more accurately. Universal 3.5 Pro Realtime keeps a short, per-session memory of the conversation from two sources:
  • The agent half — what your agent just said.
  • The user half — prior STT-finalized user turns.
With the agent’s question in context, the model can anticipate the answer, sharpen entity recognition, and disambiguate similar-sounding words. For example, after your agent asks "What's your email address?", the model can produce "user@assemblyai.com" instead of "user at assemblyai dot com". This has the biggest impact on short replies ("yes", "7pm", single names) and spelled-out entities. See Conversation context for the full reference.
In Pipecat, conversation context is automatic — no event wiring required. As long as your pipeline includes the standard LLM context aggregator (the assistant_aggregator from LLMContextAggregatorPair), Pipecat broadcasts an LLMContextAssistantTurnFrame when each bot turn completes, and AssemblyAISTTService feeds that reply to the model as agent_context automatically. Just use a U3 Pro family model on pipecat-ai 1.4.0+.
ParameterTypeDescription
agent_contextstrYour agent’s most recent spoken reply, up to ~1500 characters. Set it at construction time to seed an opening greeting; subsequent replies are fed automatically.
previous_context_n_turnsintHow many prior conversation entries are carried forward automatically. Range 0100; 0 disables carryover entirely. Construction-time only; server default is 3.

Seeding the opening greeting

The automatic feed kicks in once your agent completes its first turn. To give the model context for the user’s very first reply (the answer to your greeting), set agent_context at construction time:
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        # Seed the opening line; later turns are fed automatically by the aggregator.
        agent_context="Hi! Thanks for calling Acme. What's the email on your account?",
        # previous_context_n_turns=3,  # Default. Set 0 to disable carryover entirely.
    ),
)

Manual control with update_agent_context()

If your pipeline doesn’t use the standard LLM context aggregator, or you want explicit control over what the model sees, push the agent’s reply yourself. This is a live update — no reconnect required:
# Whenever your agent finishes speaking:
await stt.update_agent_context("Your account is past due. Would you like to pay now?")
agent_context, previous_context_n_turns, and update_agent_context() are supported only on the U3 Pro family (universal-3-5-pro, u3-rt-pro). Values are clipped to ~1500 characters and re-seeded automatically on reconnect. Setting previous_context_n_turns=0 disables the automatic feed as well.

Voice focus

Voice Focus isolates the primary speaker and suppresses background noise — chatter, keyboard clicks, fan hum, room echo — server-side, before audio reaches the model. Use it instead of client-side noise cancellation, which tends to introduce artifacts that hurt accuracy more than the noise itself.
ParameterTypeDescription
voice_focusstr"near-field" for headsets, handsets, and other close-talking mics; "far-field" for conference rooms, laptop mics, and other distant capture.
voice_focus_thresholdfloatOptional. 0.01.0; higher values suppress background audio more aggressively.
Both are construction-time parameters on the U3 Pro family. See Voice Focus for details.
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        voice_focus="far-field",    # "near-field" for close-talking mics
        voice_focus_threshold=0.5,  # Optional: 0.0–1.0, higher = more aggressive
    ),
)

Interruption handling

Barge-in — the user interrupting while the agent is speaking — is handled by Pipecat, and the signals that drive it depend on your turn detection mode.
  • Pipecat mode (vad_force_turn_endpoint=True). Pipecat’s local VAD and the Smart Turn analyzer detect the user starting to speak and interrupt the bot’s TTS. AssemblyAI also emits SpeechStarted events as a backstop.
  • AssemblyAI’s built-in turn detection (vad_force_turn_endpoint=False). The service emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame and uses AssemblyAI’s SpeechStarted events for fast barge-in. Set should_interrupt=False (constructor argument) to disable barge-in entirely in this mode.
{"type": "SpeechStarted", "timestamp": 14400, "confidence": 0.79}
On detection, Pipecat stops TTS playback and switches to listening. To reduce false interruptions from short backchannels ("mhm", "yeah", "okay"), keep your VAD threshold aligned with vad_threshold and lean on Pipecat’s Smart Turn analyzer, which evaluates whether speech is a genuine turn rather than a filler.

Dynamic configuration

Update settings mid-conversation by queueing an STTUpdateSettingsFrame with a settings delta — adapt to the conversation stage as it unfolds. See stt-assemblyai.py for a complete working example.
from pipecat.frames.frames import STTUpdateSettingsFrame
from pipecat.services.assemblyai.stt import AssemblyAISTTService

# Update keyterms during the conversation
await worker.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            keyterms_prompt=["NewName", "NewCompany"],
        )
    )
)

# Widen the silence window during entity dictation
await worker.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            min_turn_silence=200,
            max_turn_silence=3000,  # Respected in AssemblyAI's built-in turn detection mode
        )
    )
)
agent_context is the only setting applied live. Changing any other setting via STTUpdateSettingsFrame reconnects the AssemblyAI session to apply it (a brief interruption). To push conversation context without a reconnect, use the dedicated stt.update_agent_context(...) method — see Conversation context.
Conversation stageAdjustment
Caller identification (names, account IDs)Boost terms with keyterms_prompt
Entity dictation (email, phone, address)Raise max_turn_silence to ~20004000 ms, then lower it again afterward
After each agent replyAutomatic — or push agent_context via update_agent_context()
Faster barge-inLower interruption_delay
For more information, see Updating configuration mid-stream.

Speaker diarization

Identify different speakers in multi-party conversations.

Basic diarization

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        speaker_labels=True,
    ),
)
Speaker labels (e.g., "A", "B", "C") are included in final transcripts.

With custom formatting

Format transcripts with speaker labels for LLM context:
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        speaker_labels=True,
    ),
    speaker_format="<{speaker}>{text}</{speaker}>",
)
Format options:
StyleFormat string
XML<{speaker}>{text}</{speaker}>
Markdown**{speaker}**: {text}
Bracket[{speaker}] {text}

Running your agent

Development mode (local audio)

python your_agent.py
Speak into your microphone after hearing the greeting.

Production with Daily

For production deployments, use the Daily transport for WebRTC-based real-time audio/video. Your agent joins a Daily room as a participant and handles audio I/O through Daily’s infrastructure.

Telephony with Telnyx

When bridging phone calls through Pipecat (e.g., via Telnyx), the audio is 8 kHz, not 16 kHz. Match the transport sample rates:
transport = TelnyxTransport(
    # ...
    audio_in_sample_rate=8000,
    audio_out_sample_rate=8000,
)

Troubleshooting

IssueCauseSolution
universal-3-5-pro not recognizedpipecat-ai older than 1.4.0Upgrade: pip install -U "pipecat-ai[assemblyai]"
Turn over-segmentationmin_turn_silence too lowIncrease from 100 to 200500
Entities split across turnsmax_turn_silence too low (AssemblyAI mode)Increase max_turn_silence (e.g., 15003500); in Pipecat mode, raise min_turn_silence
Latency on non-terminal utterancesmax_turn_silence too highLower max_turn_silence
Conversation context has no effectNon-U3-Pro model, or previous_context_n_turns=0Use a U3 Pro family model and leave previous_context_n_turns unset (or > 0)
Mid-session setting change drops audioReconnect on a non-agent_context setting changeExpected — only agent_context updates live; use update_agent_context() for context
Mis-heard names, brands, or jargonNo vocabulary hintsAdd keyterms_prompt, or supply prompt/agent_context for context
Poor accuracy in noisy audioBackground noise or room echoEnable voice_focus (near-field or far-field)

Migrating from another STT provider

To balance accuracy, latency, turn-taking, and interruption handling, map your current setup to AssemblyAI using the questions below.

How are you detecting end-of-turn today?

TodayRecommended on AssemblyAI
Your STT provider’s own end-of-turn modelAssemblyAI’s built-in turn detection: vad_force_turn_endpoint=False with min_turn_silence=100, max_turn_silence=1000.
Silence / VAD only, with your own turn logicPipecat mode (vad_force_turn_endpoint=True, default). VAD + Smart Turn decide turns; AssemblyAI returns finals ASAP.
You want the framework to own turn-takingPipecat mode (default) — Pipecat’s Smart Turn analyzer makes the turn decision.

Which model and settings are you migrating from?

What you pass todayAssemblyAI equivalent
Current model (Deepgram, ElevenLabs, etc.)model="universal-3-5-pro" (recommended flagship) or "u3-rt-pro"
Overall accuracy/latency tuningmode="min_latency" / "balanced" / "max_accuracy" — a one-line starting point before fine-tuning
Endpointing / silence thresholdsmin_turn_silence (speculative end-of-turn) and max_turn_silence (forced end)
Custom vocabulary / keywordskeyterms_prompt=[...]; broader domain context → prompt
Provider-side conversation contextAutomatic — include the LLM context aggregator; seed greetings via agent_context
Formatting / punctuation togglesOn by default — formatted transcripts always (format_turns does not apply)
Telephony / SIP routingsample_rate=8000 and encoding="pcm_mulaw" for 8 kHz telephony
Client-side noise cancellationDrop it; use server-side Voice Focus instead
Migrating a production deployment? Talk to our team.

Speech model comparison

Interested in using a different model?
FeatureU3 Pro family
(universal-3-5-pro, u3-rt-pro)
universal-streaming-englishuniversal-streaming-multilingual
Turn Detection Modes
Pipecat mode (VAD + Smart Turn)
AssemblyAI turn detection mode
Turn Detection Parameters
min_turn_silence
max_turn_silence
end_of_turn_confidence_threshold✅ (1.0)✅ (1.0)
continuous_partials
interruption_delay
Advanced Features
Keyterms boosting
Custom prompting (beta)
Conversation context (carryover)
Voice Focus
Speaker diarization
Dynamic parameter updates
Language Support
Multilingual code switching
Language detection
Legend:
  • ✅ Fully supported and recommended
  • ❌ Not supported / Not used
The U3 Pro family is recommended for all new voice agent implementations. The universal-streaming models are maintained for backward compatibility but lack the optimizations and features specifically designed for real-time conversational AI.
The end_of_turn_confidence_threshold parameter is not used with the U3 Pro family (it won’t affect behavior). For universal-streaming models, Pipecat automatically sets it to 1.0 in Pipecat mode to disable semantic turn detection and ensure fast responses. You don’t need to configure this parameter manually.