Universal-3 Pro Streaming API

Overview

This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS, with no LiveKit, Pipecat, or other orchestrator in the loop.

Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages. This guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.

If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead. Semantic interruption handling is built in there.

Quickstart

A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:

1 import json
2 import websocket
3 from urllib.parse import urlencode
4 
5 API_KEY = "YOUR_API_KEY"
6 SAMPLE_RATE = 16000
7 
8 CONNECTION_PARAMS = {
9     "sample_rate": SAMPLE_RATE,
10     "speech_model": "u3-rt-pro",
11     "min_turn_silence": 100,
12     "max_turn_silence": 1000,
13 }
14 
15 API_ENDPOINT = (
16     "wss://streaming.assemblyai.com/v3/ws?" + urlencode(CONNECTION_PARAMS)
17 )
18 
19 
20 def on_message(ws, message):
21     data = json.loads(message)
22     msg_type = data.get("type")
23 
24     if msg_type == "Begin":
25         print(f"Session started: {data.get('id')}")
26 
27     elif msg_type == "Turn":
28         transcript = data.get("transcript", "")
29         end_of_turn = data.get("end_of_turn", False)
30         if end_of_turn:
31             # Final transcript - send to your LLM
32             print(f"Final: {transcript}")
33         else:
34             # Partial - optionally start pre-emptive LLM generation
35             print(f"Partial: {transcript}")
36 
37     elif msg_type == "SpeechStarted":
38         # User started speaking - interrupt the agent's TTS if it's playing
39         print("Speech detected, interrupt agent if speaking")
40 
41     elif msg_type == "Termination":
42         print("Session ended")
43 
44 
45 ws = websocket.WebSocketApp(
46     API_ENDPOINT,
47     header={"Authorization": API_KEY},
48     on_message=on_message,
49 )
50 ws.run_forever()

For the full message protocol, including all event fields, audio framing, and termination, see the Universal-3 Pro message sequence reference.

Turn detection

Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:

Parameter	Default	Description
`min_turn_silence`	`100` ms	Silence before a speculative end-of-turn check fires.
`max_turn_silence`	`1000` ms	Maximum silence before forcing the turn to end.

Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.

Interruption handling

While the agent is speaking, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt.

The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.

1 import json
2 import string
3 import time
4 import websocket
5 from urllib.parse import urlencode
6 
7 
8 # "yes" / "no" deliberately omitted - in a booking flow a bare "yes"
9 # is a real confirmation. Edit for your domain.
10 BACKCHANNELS = frozenset({
11     "mhm", "mm", "mmhm", "mmhmm",
12     "uh", "uhhuh", "huh",
13     "um", "umm", "uhm",
14     "er", "erm",
15     "hmm", "hm",
16     "ah", "oh",
17     "yeah", "yep", "yup",
18     "okay", "ok",
19     "right", "alright", "gotcha",
20 })
21 
22 _PUNCT_STRIP = str.maketrans("", "", string.punctuation)
23 MIN_WORDS = 2          # Utterances below this are treated as filler
24 FILTER_GRACE_S = 1.0   # Keep filtering for 1s after agent stops speaking
25 
26 
27 # These flags are owned by your TTS layer.
28 agent_speaking = False
29 last_speaking_at = 0.0
30 
31 
32 def _is_all_backchannel(text: str) -> bool:
33     tokens = text.lower().translate(_PUNCT_STRIP).split()
34     return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)
35 
36 
37 def _should_suppress_interrupt(text: str) -> bool:
38     now = time.monotonic()
39     if agent_speaking:
40         globals()["last_speaking_at"] = now
41     elif now - last_speaking_at > FILTER_GRACE_S:
42         return False
43 
44     word_count = len(text.split())
45     return word_count < MIN_WORDS or _is_all_backchannel(text)
46 
47 
48 def on_message(ws, message):
49     data = json.loads(message)
50     msg_type = data.get("type")
51 
52     if msg_type == "Turn":
53         transcript = data.get("transcript", "")
54         end_of_turn = data.get("end_of_turn", False)
55 
56         if _should_suppress_interrupt(transcript):
57             # Backchannel during agent speech - drop it.
58             return
59 
60         if end_of_turn:
61             handle_user_turn(transcript)  # send to LLM
62         else:
63             handle_partial(transcript)
64 
65     elif msg_type == "SpeechStarted":
66         if agent_speaking:
67             # Don't interrupt yet - wait for the Turn event,
68             # which is gated by _should_suppress_interrupt above.
69             return
70         # Otherwise: normal barge-in path.

How it works:

While the agent is speaking (plus a 1-second grace window after speech ends), each Turn event is checked.
_should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.
Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly. This prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.

The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default. Raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.

If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling. See LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.

Recommended configuration

Three presets covering most voice-agent use cases:

1 # Fast - quick confirmations, IVR, yes/no questions
2 fast_params = {
3     "speech_model": "u3-rt-pro",
4     "min_turn_silence": 100,
5     "max_turn_silence": 800,
6 }
7 
8 # Balanced - most voice agent conversations (recommended)
9 balanced_params = {
10     "speech_model": "u3-rt-pro",
11     "min_turn_silence": 100,
12     "max_turn_silence": 1000,
13 }
14 
15 # Patient - entity dictation, complex instructions
16 patient_params = {
17     "speech_model": "u3-rt-pro",
18     "min_turn_silence": 200,
19     "max_turn_silence": 2000,
20 }

For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.

Overview

This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS, with no LiveKit, Pipecat, or other orchestrator in the loop.

Quickstart

A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:

1 import json
2 import websocket
3 from urllib.parse import urlencode
4 
5 API_KEY = "YOUR_API_KEY"
6 SAMPLE_RATE = 16000
7 
8 CONNECTION_PARAMS = {
9     "sample_rate": SAMPLE_RATE,
10     "speech_model": "u3-rt-pro",
11     "min_turn_silence": 100,
12     "max_turn_silence": 1000,
13 }
14 
15 API_ENDPOINT = (
16     "wss://streaming.assemblyai.com/v3/ws?" + urlencode(CONNECTION_PARAMS)
17 )
18 
19 
20 def on_message(ws, message):
21     data = json.loads(message)
22     msg_type = data.get("type")
23 
24     if msg_type == "Begin":
25         print(f"Session started: {data.get('id')}")
26 
27     elif msg_type == "Turn":
28         transcript = data.get("transcript", "")
29         end_of_turn = data.get("end_of_turn", False)
30         if end_of_turn:
31             # Final transcript - send to your LLM
32             print(f"Final: {transcript}")
33         else:
34             # Partial - optionally start pre-emptive LLM generation
35             print(f"Partial: {transcript}")
36 
37     elif msg_type == "SpeechStarted":
38         # User started speaking - interrupt the agent's TTS if it's playing
39         print("Speech detected, interrupt agent if speaking")
40 
41     elif msg_type == "Termination":
42         print("Session ended")
43 
44 
45 ws = websocket.WebSocketApp(
46     API_ENDPOINT,
47     header={"Authorization": API_KEY},
48     on_message=on_message,
49 )
50 ws.run_forever()

For the full message protocol, including all event fields, audio framing, and termination, see the Universal-3 Pro message sequence reference.

Turn detection

Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:

Parameter	Default	Description
`min_turn_silence`	`100` ms	Silence before a speculative end-of-turn check fires.
`max_turn_silence`	`1000` ms	Maximum silence before forcing the turn to end.

Interruption handling

1 import json
2 import string
3 import time
4 import websocket
5 from urllib.parse import urlencode
6 
7 
8 # "yes" / "no" deliberately omitted - in a booking flow a bare "yes"
9 # is a real confirmation. Edit for your domain.
10 BACKCHANNELS = frozenset({
11     "mhm", "mm", "mmhm", "mmhmm",
12     "uh", "uhhuh", "huh",
13     "um", "umm", "uhm",
14     "er", "erm",
15     "hmm", "hm",
16     "ah", "oh",
17     "yeah", "yep", "yup",
18     "okay", "ok",
19     "right", "alright", "gotcha",
20 })
21 
22 _PUNCT_STRIP = str.maketrans("", "", string.punctuation)
23 MIN_WORDS = 2          # Utterances below this are treated as filler
24 FILTER_GRACE_S = 1.0   # Keep filtering for 1s after agent stops speaking
25 
26 
27 # These flags are owned by your TTS layer.
28 agent_speaking = False
29 last_speaking_at = 0.0
30 
31 
32 def _is_all_backchannel(text: str) -> bool:
33     tokens = text.lower().translate(_PUNCT_STRIP).split()
34     return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)
35 
36 
37 def _should_suppress_interrupt(text: str) -> bool:
38     now = time.monotonic()
39     if agent_speaking:
40         globals()["last_speaking_at"] = now
41     elif now - last_speaking_at > FILTER_GRACE_S:
42         return False
43 
44     word_count = len(text.split())
45     return word_count < MIN_WORDS or _is_all_backchannel(text)
46 
47 
48 def on_message(ws, message):
49     data = json.loads(message)
50     msg_type = data.get("type")
51 
52     if msg_type == "Turn":
53         transcript = data.get("transcript", "")
54         end_of_turn = data.get("end_of_turn", False)
55 
56         if _should_suppress_interrupt(transcript):
57             # Backchannel during agent speech - drop it.
58             return
59 
60         if end_of_turn:
61             handle_user_turn(transcript)  # send to LLM
62         else:
63             handle_partial(transcript)
64 
65     elif msg_type == "SpeechStarted":
66         if agent_speaking:
67             # Don't interrupt yet - wait for the Turn event,
68             # which is gated by _should_suppress_interrupt above.
69             return
70         # Otherwise: normal barge-in path.

How it works:

While the agent is speaking (plus a 1-second grace window after speech ends), each Turn event is checked.
_should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.
Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly. This prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.

If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling. See LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.

Recommended configuration

Three presets covering most voice-agent use cases:

1 # Fast - quick confirmations, IVR, yes/no questions
2 fast_params = {
3     "speech_model": "u3-rt-pro",
4     "min_turn_silence": 100,
5     "max_turn_silence": 800,
6 }
7 
8 # Balanced - most voice agent conversations (recommended)
9 balanced_params = {
10     "speech_model": "u3-rt-pro",
11     "min_turn_silence": 100,
12     "max_turn_silence": 1000,
13 }
14 
15 # Patient - entity dictation, complex instructions
16 patient_params = {
17     "speech_model": "u3-rt-pro",
18     "min_turn_silence": 200,
19     "max_turn_silence": 2000,
20 }

For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.

1	import json
2	import websocket
3	from urllib.parse import urlencode
4
5	API_KEY = "YOUR_API_KEY"
6	SAMPLE_RATE = 16000
7
8	CONNECTION_PARAMS = {
9	"sample_rate": SAMPLE_RATE,
10	"speech_model": "u3-rt-pro",
11	"min_turn_silence": 100,
12	"max_turn_silence": 1000,
13	}
14
15	API_ENDPOINT = (
16	"wss://streaming.assemblyai.com/v3/ws?" + urlencode(CONNECTION_PARAMS)
17	)
18
19
20	def on_message(ws, message):
21	data = json.loads(message)
22	msg_type = data.get("type")
23
24	if msg_type == "Begin":
25	print(f"Session started: {data.get('id')}")
26
27	elif msg_type == "Turn":
28	transcript = data.get("transcript", "")
29	end_of_turn = data.get("end_of_turn", False)
30	if end_of_turn:
31	# Final transcript - send to your LLM
32	print(f"Final: {transcript}")
33	else:
34	# Partial - optionally start pre-emptive LLM generation
35	print(f"Partial: {transcript}")
36
37	elif msg_type == "SpeechStarted":
38	# User started speaking - interrupt the agent's TTS if it's playing
39	print("Speech detected, interrupt agent if speaking")
40
41	elif msg_type == "Termination":
42	print("Session ended")
43
44
45	ws = websocket.WebSocketApp(
46	API_ENDPOINT,
47	header={"Authorization": API_KEY},
48	on_message=on_message,
49	)
50	ws.run_forever()

1	import json
2	import string
3	import time
4	import websocket
5	from urllib.parse import urlencode
6
7
8	# "yes" / "no" deliberately omitted - in a booking flow a bare "yes"
9	# is a real confirmation. Edit for your domain.
10	BACKCHANNELS = frozenset({
11	"mhm", "mm", "mmhm", "mmhmm",
12	"uh", "uhhuh", "huh",
13	"um", "umm", "uhm",
14	"er", "erm",
15	"hmm", "hm",
16	"ah", "oh",
17	"yeah", "yep", "yup",
18	"okay", "ok",
19	"right", "alright", "gotcha",
20	})
21
22	_PUNCT_STRIP = str.maketrans("", "", string.punctuation)
23	MIN_WORDS = 2 # Utterances below this are treated as filler
24	FILTER_GRACE_S = 1.0 # Keep filtering for 1s after agent stops speaking
25
26
27	# These flags are owned by your TTS layer.
28	agent_speaking = False
29	last_speaking_at = 0.0
30
31
32	def _is_all_backchannel(text: str) -> bool:
33	tokens = text.lower().translate(_PUNCT_STRIP).split()
34	return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)
35
36
37	def _should_suppress_interrupt(text: str) -> bool:
38	now = time.monotonic()
39	if agent_speaking:
40	globals()["last_speaking_at"] = now
41	elif now - last_speaking_at > FILTER_GRACE_S:
42	return False
43
44	word_count = len(text.split())
45	return word_count < MIN_WORDS or _is_all_backchannel(text)
46
47
48	def on_message(ws, message):
49	data = json.loads(message)
50	msg_type = data.get("type")
51
52	if msg_type == "Turn":
53	transcript = data.get("transcript", "")
54	end_of_turn = data.get("end_of_turn", False)
55
56	if _should_suppress_interrupt(transcript):
57	# Backchannel during agent speech - drop it.
58	return
59
60	if end_of_turn:
61	handle_user_turn(transcript) # send to LLM
62	else:
63	handle_partial(transcript)
64
65	elif msg_type == "SpeechStarted":
66	if agent_speaking:
67	# Don't interrupt yet - wait for the Turn event,
68	# which is gated by _should_suppress_interrupt above.
69	return
70	# Otherwise: normal barge-in path.

1	# Fast - quick confirmations, IVR, yes/no questions
2	fast_params = {
3	"speech_model": "u3-rt-pro",
4	"min_turn_silence": 100,
5	"max_turn_silence": 800,
6	}
7
8	# Balanced - most voice agent conversations (recommended)
9	balanced_params = {
10	"speech_model": "u3-rt-pro",
11	"min_turn_silence": 100,
12	"max_turn_silence": 1000,
13	}
14
15	# Patient - entity dictation, complex instructions
16	patient_params = {
17	"speech_model": "u3-rt-pro",
18	"min_turn_silence": 200,
19	"max_turn_silence": 2000,
20	}

Overview

Quickstart

Turn detection

Interruption handling

Recommended configuration

Related

Overview

Quickstart

Turn detection

Interruption handling

Recommended configuration

Related