ModelsUniversal-3 Pro Streaming

Universal-3 Pro Streaming

Set up and configure Universal-3 Pro Streaming for real-time streaming transcription.

Universal-3 Pro Streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.

This model is fantastic for voice agents, agent assist, and all streaming use cases that don’t require partial transcriptions for every single subword — an early partial is emitted after 750ms of continuous speech, followed by silence-based partials as the speaker pauses (see Partials behavior for details). Universal-3 Pro Streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names — all with sub-300ms time to complete transcript latency.

Already using AssemblyAI streaming?

If you’re an existing AssemblyAI streaming user, you can quickly test Universal-3 Pro Streaming by switching the speech_model parameter to "u3-rt-pro" in your connection parameters. No other code changes are required — just update the model and start streaming.

Streaming is billed per session

Universal-3 Pro Streaming is billed on the total duration that your WebSocket connection stays open, not on the amount of audio you send. Always send a Terminate message when you’re done with a stream — sessions that aren’t closed auto-close after 3 hours and are billed for the full duration. See Billing and pricing for details.

Quickstart

Get started with Universal-3 Pro Streaming using the code below. This example streams audio from your microphone and prints transcription results in real time — no custom prompt is needed, since Universal-3 Pro automatically applies a default prompt optimized for turn detection.

1

Install the required libraries

$pip install websocket-client pyaudio
2

Create a new file main.py and paste the code below. Replace <YOUR_API_KEY> with your API key.

3

Run with python main.py and speak into your microphone.

1import pyaudio
2import websocket
3import json
4import threading
5import time
6from urllib.parse import urlencode
7
8YOUR_API_KEY = "<YOUR_API_KEY>"
9
10CONNECTION_PARAMS = {
11 "sample_rate": 16000,
12 "speech_model": "u3-rt-pro",
13}
14API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
15API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
16
17FRAMES_PER_BUFFER = 800
18SAMPLE_RATE = CONNECTION_PARAMS["sample_rate"]
19CHANNELS = 1
20FORMAT = pyaudio.paInt16
21
22audio = None
23stream = None
24ws_app = None
25audio_thread = None
26stop_event = threading.Event()
27
28def on_open(ws):
29 print("WebSocket connection opened.")
30 def stream_audio():
31 global stream
32 while not stop_event.is_set():
33 try:
34 audio_data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
35 ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
36 except Exception as e:
37 print(f"Error streaming audio: {e}")
38 break
39
40 global audio_thread
41 audio_thread = threading.Thread(target=stream_audio)
42 audio_thread.daemon = True
43 audio_thread.start()
44
45def on_message(ws, message):
46 try:
47 data = json.loads(message)
48 msg_type = data.get("type")
49
50 if msg_type == "Begin":
51 print(f"Session began: ID={data.get('id')}")
52 elif msg_type == "Turn":
53 transcript = data.get("transcript", "")
54 end_of_turn = data.get("end_of_turn", False)
55 if end_of_turn:
56 print(f"\r{' ' * 80}\r{transcript}")
57 else:
58 print(f"\r{transcript}", end="")
59 elif msg_type == "Termination":
60 print(f"\nSession terminated: {data.get('audio_duration_seconds', 0)}s of audio")
61 except Exception as e:
62 print(f"Error handling message: {e}")
63
64def on_error(ws, error):
65 print(f"\nWebSocket Error: {error}")
66 stop_event.set()
67
68def on_close(ws, close_status_code, close_msg):
69 print(f"\nWebSocket Disconnected: Status={close_status_code}")
70 global stream, audio
71 stop_event.set()
72 if stream:
73 if stream.is_active():
74 stream.stop_stream()
75 stream.close()
76 if audio:
77 audio.terminate()
78
79def run():
80 global audio, stream, ws_app
81
82 audio = pyaudio.PyAudio()
83 stream = audio.open(
84 input=True,
85 frames_per_buffer=FRAMES_PER_BUFFER,
86 channels=CHANNELS,
87 format=FORMAT,
88 rate=SAMPLE_RATE,
89 )
90 print("Speak into your microphone. Press Ctrl+C to stop.")
91
92 ws_app = websocket.WebSocketApp(
93 API_ENDPOINT,
94 header={"Authorization": YOUR_API_KEY},
95 on_open=on_open,
96 on_message=on_message,
97 on_error=on_error,
98 on_close=on_close,
99 )
100
101 ws_thread = threading.Thread(target=ws_app.run_forever)
102 ws_thread.daemon = True
103 ws_thread.start()
104
105 try:
106 while ws_thread.is_alive():
107 time.sleep(0.1)
108 except KeyboardInterrupt:
109 print("\nStopping...")
110 stop_event.set()
111 if ws_app and ws_app.sock and ws_app.sock.connected:
112 ws_app.send(json.dumps({"type": "Terminate"}))
113 time.sleep(2)
114 if ws_app:
115 ws_app.close()
116 ws_thread.join(timeout=2.0)
117
118if __name__ == "__main__":
119 run()

Prompting

Universal-3 Pro supports custom prompts and keyterms prompting to improve transcription accuracy for your use case. For detailed guidance on crafting effective prompts, default prompt behavior, and keyterms prompting, see the Prompting Guide (Streaming).

You can also boost recognition of specific terms using the keyterms_prompt parameter. See Keyterms prompting for details.

Configuring turn detection

Universal-3 Pro uses a punctuation-based turn detection system controlled by two parameters:

ParameterDefaultDescription
min_turn_silence100 msSilence duration before a speculative end-of-turn (EOT) check fires.
max_turn_silence1000 msMaximum silence before a turn is forced to end.

When silence reaches min_turn_silence, the model transcribes the audio and checks for terminal punctuation (. ? !):

  • Terminal punctuation found — the turn ends and is emitted as a final transcript (end_of_turn: true).
  • No terminal punctuation — a partial transcript is emitted (end_of_turn: false) and the turn continues waiting.
    • If silence continues to max_turn_silence, the turn is forced to end as a final transcript (end_of_turn: true) regardless of punctuation.

This differs from Universal-Streaming English and Multilingual, which use a confidence-based end-of-turn system controlled by end_of_turn_confidence_threshold.

Instead, Universal-3 Pro makes turn decisions based on ending punctuation after min_turn_silence has elapsed. Because of this, end_of_turn_confidence_threshold has no impact.

end_of_turn and turn_is_formatted

Because formatting is built into the end-of-turn system in Universal-3 Pro streaming, there is only ever one end-of-turn transcript per turn and it is always formatted. This means end_of_turn and turn_is_formatted always have the same value for Universal-3 Pro streaming. You can reliably use end_of_turn: true to detect a formatted, final end-of-turn transcript.

For example, to configure both parameters:

1{
2 "speech_model": "u3-rt-pro",
3 "min_turn_silence": 100,
4 "max_turn_silence": 1000
5}

Partials behavior

Partials are Turn events where end_of_turn is false. They are produced in three ways:

  • Early partial — emitted after 750ms of continuous speech by default, providing a fast transcript signal for barge-in and speculative inference without waiting for the speaker to pause. You can tune this timing with the interruption_delay parameter (see Tuning early partial timing below). If the first attempt returns empty, it retries at 1500ms, 2250ms, and so on. Only one early partial is emitted per turn, but additional partials can be produced when the speaker pauses.
  • Silence-based partials — produced whenever min_turn_silence is met, but the ending punctuation doesn’t signal the end of a turn. Each period of silence can produce at most one partial.
  • Continuous partials — emitted approximately every 3 seconds while speech continues, regardless of silence. Each continuous partial covers the full transcript for the current turn so far. Enable with the continuous_partials connection parameter.

There can be multiple partial transcripts per turn. If silence exceeds min_turn_silence, but speech resumes before max_turn_silence, the partial is emitted and the EOT check resets until the next period of silence.

If you’re running eager LLM inference on partial transcripts, we recommend setting min_turn_silence to 100.

Entity splitting (accuracy) vs Model Latency trade-off

Setting min_turn_silence too low can split entities like phone numbers and emails. We have found LLM steps fix this for voice agents, but we recommend testing carefully with your use case.

Continuous partials

For long, uninterrupted turns — such as a caller reading out a credit card number or giving a detailed explanation — silence-based partials may not fire often enough for your downstream consumers (LLMs, UI, eager inference) to keep up. Enable continuous_partials to receive a steady stream of non-final transcripts every ~3 seconds while speech continues.

1CONNECTION_PARAMS = {
2 "sample_rate": 16000,
3 "speech_model": "u3-rt-pro",
4 "continuous_partials": True,
5}

The first partial is still emitted at 750ms (or your configured interruption_delay). Continuous partials are non-final (end_of_turn: false) and each one covers the full transcript for the current turn so far. The final transcript is emitted as normal when the turn ends.

Tuning early partial timing

The interruption_delay parameter controls how soon the first partial transcript is emitted during a turn, directly affecting your time to first token (TTFT). This is the primary lever for tuning barge-in responsiveness and speculative LLM inference timing.

ParameterDefaultRangeDescription
interruption_delay500 ms01000 msHow soon the first partial is emitted. Lower values produce faster TTFT; higher values are more confident.

The server adds a minimum turn duration of 300ms on top of your configured value, so the effective timing is:

  • interruption_delay: 0 → ~300ms effective (fastest possible first partial)
  • interruption_delay: 500 → ~800ms effective (default)
  • interruption_delay: 1000 → ~1300ms effective (most confident, slowest TTFT)
1CONNECTION_PARAMS = {
2 "sample_rate": 16000,
3 "speech_model": "u3-rt-pro",
4 "interruption_delay": 0,
5}
6API_ENDPOINT = (
7 f"wss://streaming.assemblyai.com/v3/ws"
8 f"?{urlencode(CONNECTION_PARAMS)}"
9)
10
11ws = websocket.WebSocketApp(
12 API_ENDPOINT,
13 header={"Authorization": YOUR_API_KEY},
14)

You can also update interruption_delay mid-session via UpdateConfiguration — for example, lower it when the agent is speaking (for faster barge-in) and raise it when waiting for a user response:

1ws.send(json.dumps({
2 "type": "UpdateConfiguration",
3 "interruption_delay": 200,
4}))

When to adjust interruption_delay:

  • Lower values (0–200ms) — Use when TTFT is critical and you want the earliest possible signal for speculative LLM inference or barge-in detection. The first partial may be less complete since less audio has been buffered.
  • Default (500ms) — Balanced for most voice agent use cases. The first partial arrives with enough audio context to be useful without excessive delay.
  • Higher values (500–1000ms) — Use when you prefer fewer, more confident partials and don’t need aggressive barge-in responsiveness. Reduces unnecessary early partials in scenarios where users tend to speak in longer turns.

See the UpdateConfiguration examples above for dynamic mid-session adjustment.

Formatting and turn detection

Because the model applies punctuation and formatting intelligently, this works well with formatting-based turn detection. For example, based purely on vocal tone:

  • "Pizza." — Statement
  • "Pizza?" — Questioning tone
  • "Pizza---" — Trailing off

The punctuation quality has been excellent when paired with custom turn detection models.

From testing, mid-turn emission looks like this — where each line is an additional partial leading up to the final end-of-turn transcript:

"Yeah my credit card number is--"
"One moment---"
"Its 8888-8888-8888-8888" ← end_of_turn: true

Each partial is emitted during a silence period within the turn. The final line with terminal punctuation triggers the end of turn.

Forcing a turn endpoint

You can force the current turn to end immediately by sending a ForceEndpoint message:

1{
2 "type": "ForceEndpoint"
3}

This is useful when your application knows the user has finished speaking based on external signals (e.g., a button press).

Specifying the transcription language

Universal-3 Pro Streaming does not support the language_code connection parameter — it is silently ignored. The language_detection parameter only controls whether language metadata (such as language_code and language_confidence) is returned on Turn events; it does not affect which language the model transcribes.

To guide the transcription language, use the prompt parameter as described below.

Providing language information ahead of time in the prompt helps the model with transcription tasks. For example, if the model is told to transcribe Spanish, audio could be transcribed “si”, but if told English, it could be transcribed “C”.

Although prompting is a beta feature, we’ve found good results when you build off of the default prompt — which is exactly what we do here for adding language information by prepending Transcribe <language>. to the default prompt.

Our team is running evaluations to determine the best method for attaching this context to the prompt, and we will update this section with the best methods. So far, we have seen that prepending language information with Transcribe <language>. to the default prompt improves the output:

Transcribe Spanish. Transcribe verbatim with standard punctuation. Include filler words and incomplete utterances.

If you have multiple languages, append all languages like Transcribe multilingual conversation in English, Spanish, and German.

Supported languages and regional dialects

Universal-3 Pro Streaming supports 6 languages with out-of-the-box recognition of regional dialects and local speech variants. See the Supported languages page for the full language list and dialect reference.

Updating configuration mid-stream

You can update configuration during an active streaming session using UpdateConfiguration. This applies changes without needing to reconnect. The recommended approach is to dynamically update keyterms_prompt based on the current stage of your voice agent flow — if you expect certain answers or terminology at a specific stage, proactively add those as keyterms so the model recognizes them accurately.

1# Replace or establish new set of keyterms
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["Universal-3"]}')
3
4# Remove keyterms and reset context biasing
5websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": []}')

For example, if your voice agent is currently asking for the caller’s name and date of birth, send the expected terms for that stage:

1# Caller identification stage
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["Kelly Byrne-Donoghue", "date of birth", "January", "February"]}')

Then, when the conversation moves to a different stage (e.g., medical intake), update with the relevant terms:

1# Medical intake stage
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["cardiology", "echocardiogram", "Dr. Patel", "metoprolol"]}')

You can also update prompt, max_turn_silence, min_turn_silence, interruption_delay, or any combination at the same time:

1{
2 "type": "UpdateConfiguration",
3 "keyterms_prompt": ["account number", "routing number"],
4 "max_turn_silence": 5000,
5 "min_turn_silence": 200
6}

Common reasons to update configuration mid-stream:

  • keyterms_prompt — Dynamically add terms relevant to the current stage of your voice agent flow. This is the most effective way to improve recognition accuracy mid-stream. See Keyterms prompting for details.
  • prompt — Pass updated behavioral or formatting instructions into the STT stream.
  • max_turn_silence — Increase for moments where you’d expect a longer pause, such as when a caller is reading out a credit card number, ID number, or address. Decrease it again afterward to resume snappier turn detection.
  • min_turn_silence — Tune how quickly speculative EOT checks fire. Lower values produce faster partials for eager LLM inference, while higher values reduce entity splitting for utterances with numbers or proper nouns.
  • interruption_delay — Tune how quickly the first partial is emitted. Lower values (e.g. 0) produce faster TTFT for aggressive barge-in detection; higher values (e.g. 5001000) produce more confident first partials. See Tuning early partial timing for details.
  • continuous_partials — Toggle steady-cadence partial emission on or off mid-session. Useful when switching between interaction modes where you need more frequent feedback for some turns but not others.
1websocket.send('{"type": "UpdateConfiguration", "continuous_partials": true}')

Keep alive

KeepAlive messages are not required. By default, sessions remain open until explicitly terminated or until the 3-hour maximum session duration is reached.

KeepAlive is only relevant if you have configured the inactivity_timeout connection parameter, which closes the session after a period of no audio or messages being sent. If you are using inactivity_timeout and want to keep the session open during periods where no audio is being sent, send a KeepAlive message to reset the inactivity timer:

1{ "type": "KeepAlive" }