For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PlaygroundChangelogSign In
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
  • Voice Agent API
    • Quickstart
    • Configure your agent
    • Turn detection and interruptions
    • Prompting guide
    • Choose a voice
    • Send and play audio
    • Add tools to your agent
    • Connect from a browser
    • Connect to Twilio
    • Reference WebSocket events
    • Troubleshooting
  • Integrations
    • LiveKit
    • Pipecat
  • Build your own
    • Universal-3 Pro Streaming API
    • Best practices
LogoLogo
PlaygroundChangelogSign In
On this page
  • Overview
  • Quickstart
  • Turn detection
  • Interruption handling
  • Recommended configuration
  • Related
Build your own

Universal-3 Pro Streaming API

Was this page helpful?
Previous

Best Practices for Building Voice Agents

Next
Built with

Overview

This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS, with no LiveKit, Pipecat, or other orchestrator in the loop.

Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages. This guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.

If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead. Semantic interruption handling is built in there.

Quickstart

A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:

1import json
2import websocket
3from urllib.parse import urlencode
4
5API_KEY = "YOUR_API_KEY"
6SAMPLE_RATE = 16000
7
8CONNECTION_PARAMS = {
9 "sample_rate": SAMPLE_RATE,
10 "speech_model": "u3-rt-pro",
11 "min_turn_silence": 100,
12 "max_turn_silence": 1000,
13}
14
15API_ENDPOINT = (
16 "wss://streaming.assemblyai.com/v3/ws?" + urlencode(CONNECTION_PARAMS)
17)
18
19
20def on_message(ws, message):
21 data = json.loads(message)
22 msg_type = data.get("type")
23
24 if msg_type == "Begin":
25 print(f"Session started: {data.get('id')}")
26
27 elif msg_type == "Turn":
28 transcript = data.get("transcript", "")
29 end_of_turn = data.get("end_of_turn", False)
30 if end_of_turn:
31 # Final transcript - send to your LLM
32 print(f"Final: {transcript}")
33 else:
34 # Partial - optionally start pre-emptive LLM generation
35 print(f"Partial: {transcript}")
36
37 elif msg_type == "SpeechStarted":
38 # User started speaking - interrupt the agent's TTS if it's playing
39 print("Speech detected, interrupt agent if speaking")
40
41 elif msg_type == "Termination":
42 print("Session ended")
43
44
45ws = websocket.WebSocketApp(
46 API_ENDPOINT,
47 header={"Authorization": API_KEY},
48 on_message=on_message,
49)
50ws.run_forever()

For the full message protocol, including all event fields, audio framing, and termination, see the Universal-3 Pro message sequence reference.

Turn detection

Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:

ParameterDefaultDescription
min_turn_silence100 msSilence before a speculative end-of-turn check fires.
max_turn_silence1000 msMaximum silence before forcing the turn to end.

Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.

Interruption handling

While the agent is speaking, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt.

The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.

1import json
2import string
3import time
4import websocket
5from urllib.parse import urlencode
6
7
8# "yes" / "no" deliberately omitted - in a booking flow a bare "yes"
9# is a real confirmation. Edit for your domain.
10BACKCHANNELS = frozenset({
11 "mhm", "mm", "mmhm", "mmhmm",
12 "uh", "uhhuh", "huh",
13 "um", "umm", "uhm",
14 "er", "erm",
15 "hmm", "hm",
16 "ah", "oh",
17 "yeah", "yep", "yup",
18 "okay", "ok",
19 "right", "alright", "gotcha",
20})
21
22_PUNCT_STRIP = str.maketrans("", "", string.punctuation)
23MIN_WORDS = 2 # Utterances below this are treated as filler
24FILTER_GRACE_S = 1.0 # Keep filtering for 1s after agent stops speaking
25
26
27# These flags are owned by your TTS layer.
28agent_speaking = False
29last_speaking_at = 0.0
30
31
32def _is_all_backchannel(text: str) -> bool:
33 tokens = text.lower().translate(_PUNCT_STRIP).split()
34 return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)
35
36
37def _should_suppress_interrupt(text: str) -> bool:
38 now = time.monotonic()
39 if agent_speaking:
40 globals()["last_speaking_at"] = now
41 elif now - last_speaking_at > FILTER_GRACE_S:
42 return False
43
44 word_count = len(text.split())
45 return word_count < MIN_WORDS or _is_all_backchannel(text)
46
47
48def on_message(ws, message):
49 data = json.loads(message)
50 msg_type = data.get("type")
51
52 if msg_type == "Turn":
53 transcript = data.get("transcript", "")
54 end_of_turn = data.get("end_of_turn", False)
55
56 if _should_suppress_interrupt(transcript):
57 # Backchannel during agent speech - drop it.
58 return
59
60 if end_of_turn:
61 handle_user_turn(transcript) # send to LLM
62 else:
63 handle_partial(transcript)
64
65 elif msg_type == "SpeechStarted":
66 if agent_speaking:
67 # Don't interrupt yet - wait for the Turn event,
68 # which is gated by _should_suppress_interrupt above.
69 return
70 # Otherwise: normal barge-in path.

How it works:

  1. While the agent is speaking (plus a 1-second grace window after speech ends), each Turn event is checked.
  2. _should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.
  3. Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
  4. SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly. This prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.

The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default. Raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.

If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling. See LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.

Recommended configuration

Three presets covering most voice-agent use cases:

1# Fast - quick confirmations, IVR, yes/no questions
2fast_params = {
3 "speech_model": "u3-rt-pro",
4 "min_turn_silence": 100,
5 "max_turn_silence": 800,
6}
7
8# Balanced - most voice agent conversations (recommended)
9balanced_params = {
10 "speech_model": "u3-rt-pro",
11 "min_turn_silence": 100,
12 "max_turn_silence": 1000,
13}
14
15# Patient - entity dictation, complex instructions
16patient_params = {
17 "speech_model": "u3-rt-pro",
18 "min_turn_silence": 200,
19 "max_turn_silence": 2000,
20}

For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.

Related

  • Universal-3 Pro Streaming overview
  • Universal-3 Pro message sequence
  • LiveKit integration
  • Pipecat integration
  • Voice agent best practices