Universal-3 Pro Streaming API

Overview

This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS — no LiveKit, Pipecat, or other orchestrator in the loop.

Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages — this guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.

If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead — semantic interruption handling is built in there.

Quickstart

A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:

1import json
2import websocket
3from urllib.parse import urlencode
4
5API_KEY = "YOUR_API_KEY"
6SAMPLE_RATE = 16000
7
8CONNECTION_PARAMS = {
9 "sample_rate": SAMPLE_RATE,
10 "speech_model": "u3-rt-pro",
11 "min_turn_silence": 100,
12 "max_turn_silence": 1000,
13}
14
15API_ENDPOINT = (
16 "wss://streaming.assemblyai.com/v3/ws?" + urlencode(CONNECTION_PARAMS)
17)
18
19
20def on_message(ws, message):
21 data = json.loads(message)
22 msg_type = data.get("type")
23
24 if msg_type == "Begin":
25 print(f"Session started: {data.get('id')}")
26
27 elif msg_type == "Turn":
28 transcript = data.get("transcript", "")
29 end_of_turn = data.get("end_of_turn", False)
30 if end_of_turn:
31 # Final transcript — send to your LLM
32 print(f"Final: {transcript}")
33 else:
34 # Partial — optionally start pre-emptive LLM generation
35 print(f"Partial: {transcript}")
36
37 elif msg_type == "SpeechStarted":
38 # User started speaking — interrupt the agent's TTS if it's playing
39 print("Speech detected — interrupt agent if speaking")
40
41 elif msg_type == "Termination":
42 print("Session ended")
43
44
45ws = websocket.WebSocketApp(
46 API_ENDPOINT,
47 header={"Authorization": API_KEY},
48 on_message=on_message,
49)
50ws.run_forever()

For the full message protocol — including all event fields, audio framing, and termination — see the Universal-3 Pro message sequence reference.

Turn detection

Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:

ParameterDefaultDescription
min_turn_silence100 msSilence before a speculative end-of-turn check fires.
max_turn_silence1000 msMaximum silence before forcing the turn to end.

Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.

Interruption handling

While the agent is speaking, users often produce backchannel utterances — “mhm”, “yeah”, “um”, “okay” — that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt.

The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.

1import json
2import string
3import time
4import websocket
5from urllib.parse import urlencode
6
7
8# "yes" / "no" deliberately omitted — in a booking flow a bare "yes"
9# is a real confirmation. Edit for your domain.
10BACKCHANNELS = frozenset({
11 "mhm", "mm", "mmhm", "mmhmm",
12 "uh", "uhhuh", "huh",
13 "um", "umm", "uhm",
14 "er", "erm",
15 "hmm", "hm",
16 "ah", "oh",
17 "yeah", "yep", "yup",
18 "okay", "ok",
19 "right", "alright", "gotcha",
20})
21
22_PUNCT_STRIP = str.maketrans("", "", string.punctuation)
23MIN_WORDS = 2 # Utterances below this are treated as filler
24FILTER_GRACE_S = 1.0 # Keep filtering for 1s after agent stops speaking
25
26
27# These flags are owned by your TTS layer.
28agent_speaking = False
29last_speaking_at = 0.0
30
31
32def _is_all_backchannel(text: str) -> bool:
33 tokens = text.lower().translate(_PUNCT_STRIP).split()
34 return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)
35
36
37def _should_suppress_interrupt(text: str) -> bool:
38 now = time.monotonic()
39 if agent_speaking:
40 globals()["last_speaking_at"] = now
41 elif now - last_speaking_at > FILTER_GRACE_S:
42 return False
43
44 word_count = len(text.split())
45 return word_count < MIN_WORDS or _is_all_backchannel(text)
46
47
48def on_message(ws, message):
49 data = json.loads(message)
50 msg_type = data.get("type")
51
52 if msg_type == "Turn":
53 transcript = data.get("transcript", "")
54 end_of_turn = data.get("end_of_turn", False)
55
56 if _should_suppress_interrupt(transcript):
57 # Backchannel during agent speech — drop it.
58 return
59
60 if end_of_turn:
61 handle_user_turn(transcript) # send to LLM
62 else:
63 handle_partial(transcript)
64
65 elif msg_type == "SpeechStarted":
66 if agent_speaking:
67 # Don't interrupt yet — wait for the Turn event,
68 # which is gated by _should_suppress_interrupt above.
69 return
70 # Otherwise: normal barge-in path.

How it works:

  1. While the agent is speaking (plus a 1-second grace window after speech ends), each Turn event is checked.
  2. _should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.
  3. Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
  4. SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly — this prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.

The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default — raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.

If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling — see LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.

Three presets covering most voice-agent use cases:

1# Fast — quick confirmations, IVR, yes/no questions
2fast_params = {
3 "speech_model": "u3-rt-pro",
4 "min_turn_silence": 100,
5 "max_turn_silence": 800,
6}
7
8# Balanced — most voice agent conversations (recommended)
9balanced_params = {
10 "speech_model": "u3-rt-pro",
11 "min_turn_silence": 100,
12 "max_turn_silence": 1000,
13}
14
15# Patient — entity dictation, complex instructions
16patient_params = {
17 "speech_model": "u3-rt-pro",
18 "min_turn_silence": 200,
19 "max_turn_silence": 2000,
20}

For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.