Universal-3 Pro Streaming on Pipecat
Universal-3 Pro Streaming on Pipecat
Universal-3 Pro Streaming on Pipecat
This guide covers integrating AssemblyAI’s Universal-3 Pro Streaming (u3-rt-pro) speech-to-text model into Pipecat voice agents.
Universal-3 Pro Streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.
Universal-3 Pro Streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names. All with sub-300ms time to complete transcript latency.
Complete working examples are available in the Pipecat repository:
You can run any example directly as long as your API keys are saved in a .env file:
The vad_force_turn_endpoint parameter controls which turn detection mode is
used. It defaults to True (Pipecat mode), which sends a ForceEndpoint
message to AssemblyAI when the local VAD detects silence. Set it to False to
use AssemblyAI’s built-in turn detection instead. Choosing the right mode is
critical for balancing responsiveness and turn accuracy in your voice agent.
Install Pipecat with all required dependencies:
What’s included:
assemblyai: AssemblyAI U3-Pro STT serviceopenai: OpenAI LLM service (used in the examples)cartesia: Cartesia TTS service (used in the examples)The examples use OpenAI and Cartesia, but you can use any LLM or TTS you want
that’s supported by Pipecat. Just swap out the extras in the install command
(e.g., pipecat-ai[assemblyai,anthropic,elevenlabs]).
Set your API keys in a .env file:
You can obtain an AssemblyAI API key by signing up here.
Within Pipecat, you have two distinct approaches to turn detection with AssemblyAI’s U3-Pro model.
When to use: Most voice agent applications requiring responsive interruptions.
How it works:
ForceEndpoint message sent to AssemblyAI on VAD silence detectionmax_turn_silence automatically synchronized with min_turn_silenceWhen to use: When you want AssemblyAI’s built-in turn detection to control turn endings. This mode is configurable within the settings. See Configuring turn detection to understand how it works.
How it works:
UserStartedSpeakingFrame / UserStoppedSpeakingFrameSpeechStarted events for fast barge-inu3-rt-pro (other models require Pipecat mode)AssemblyAI’s built-in turn detection uses the STT model’s understanding of speech patterns to determine turn boundaries, rather than relying on local VAD silence detection.
Improve recognition of specific words or names:
Change configuration mid-conversation without reconnection. See stt-assemblyai.py for a complete working example.
Identify different speakers in multi-party conversations.
Speaker labels (e.g., "A", "B", "C") are included in final transcripts and logged.
Format transcripts with speaker labels for LLM context:
Format options:
For production deployments, use the Daily transport for WebRTC-based real-time audio/video communication.
The speech model to use. Defaults to "u3-rt-pro" (Universal-3 Pro
Streaming).
Milliseconds of silence before ending a turn when model is confident. Set to
100 for best latency. (Formerly min_end_of_turn_silence_when_confident,
which is deprecated but still supported with a warning.)
Maximum silence before forced turn end. Auto-synced in Pipecat mode; respected in AssemblyAI’s built-in turn detection (STT mode).
List of terms to boost recognition for. Cannot be used with prompt.
Enable speaker diarization.
Custom transcription instructions. Cannot be used with keyterms_prompt.
Prompting is currently a beta feature: see
Prompting for more information.
Your AssemblyAI API key.
True for Pipecat mode; False for AssemblyAI’s built-in turn detection (STT
mode).
Template string for formatting speaker labels (e.g., "[{speaker}] {text}").
Speak into your microphone after hearing the greeting.
Deploy to Daily.co rooms using the Daily transport. Your agent joins as a participant and handles audio I/O through Daily’s infrastructure.
Interested in using a different model?
Legend:
u3-rt-pro is the recommended model for all new voice agent implementations. The universal-streaming models are maintained for backward compatibility but lack the optimizations and features specifically designed for real-time conversational AI.
The end_of_turn_confidence_threshold parameter is not used with
u3-rt-pro (it won’t affect behavior). For universal-streaming models, Pipecat
automatically sets it to 1.0 in Pipecat mode to disable semantic turn
detection and ensure fast responses. You don’t need to configure this
parameter manually.