Models & featuresUniversal-3 Pro Streaming

Universal-3 Pro Streaming

Set up and configure Universal-3 Pro Streaming for real-time streaming transcription.

Universal-3 Pro Streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.

This model is fantastic for voice agents, agent assist, and all streaming use cases that don’t require partial transcriptions for every single subword — partials are only produced during periods of silence, with at most one partial per silence period (see Partials behavior for details). Universal-3 Pro Streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names — all with sub-300ms time to complete transcript latency.

Already using AssemblyAI streaming?

If you’re an existing AssemblyAI streaming user, you can quickly test Universal-3 Pro Streaming by switching the speech_model parameter to "u3-rt-pro" in your connection parameters. No other code changes are required — just update the model and start streaming.

Quickstart

Get started with Universal-3 Pro Streaming using the code below. This example streams audio from your microphone and prints transcription results in real time — no custom prompt is needed, since Universal-3 Pro automatically applies a default prompt optimized for turn detection.

1

Install the required libraries

$pip install websocket-client pyaudio
2

Create a new file main.py and paste the code below. Replace <YOUR_API_KEY> with your API key.

3

Run with python main.py and speak into your microphone.

1import pyaudio
2import websocket
3import json
4import threading
5import time
6from urllib.parse import urlencode
7
8YOUR_API_KEY = "<YOUR_API_KEY>"
9
10CONNECTION_PARAMS = {
11 "sample_rate": 16000,
12 "speech_model": "u3-rt-pro",
13}
14API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
15API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
16
17FRAMES_PER_BUFFER = 800
18SAMPLE_RATE = CONNECTION_PARAMS["sample_rate"]
19CHANNELS = 1
20FORMAT = pyaudio.paInt16
21
22audio = None
23stream = None
24ws_app = None
25audio_thread = None
26stop_event = threading.Event()
27
28def on_open(ws):
29 print("WebSocket connection opened.")
30 def stream_audio():
31 global stream
32 while not stop_event.is_set():
33 try:
34 audio_data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
35 ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
36 except Exception as e:
37 print(f"Error streaming audio: {e}")
38 break
39
40 global audio_thread
41 audio_thread = threading.Thread(target=stream_audio)
42 audio_thread.daemon = True
43 audio_thread.start()
44
45def on_message(ws, message):
46 try:
47 data = json.loads(message)
48 msg_type = data.get("type")
49
50 if msg_type == "Begin":
51 print(f"Session began: ID={data.get('id')}")
52 elif msg_type == "Turn":
53 transcript = data.get("transcript", "")
54 end_of_turn = data.get("end_of_turn", False)
55 if end_of_turn:
56 print(f"\r{' ' * 80}\r{transcript}")
57 else:
58 print(f"\r{transcript}", end="")
59 elif msg_type == "Termination":
60 print(f"\nSession terminated: {data.get('audio_duration_seconds', 0)}s of audio")
61 except Exception as e:
62 print(f"Error handling message: {e}")
63
64def on_error(ws, error):
65 print(f"\nWebSocket Error: {error}")
66 stop_event.set()
67
68def on_close(ws, close_status_code, close_msg):
69 print(f"\nWebSocket Disconnected: Status={close_status_code}")
70 global stream, audio
71 stop_event.set()
72 if stream:
73 if stream.is_active():
74 stream.stop_stream()
75 stream.close()
76 if audio:
77 audio.terminate()
78
79def run():
80 global audio, stream, ws_app
81
82 audio = pyaudio.PyAudio()
83 stream = audio.open(
84 input=True,
85 frames_per_buffer=FRAMES_PER_BUFFER,
86 channels=CHANNELS,
87 format=FORMAT,
88 rate=SAMPLE_RATE,
89 )
90 print("Speak into your microphone. Press Ctrl+C to stop.")
91
92 ws_app = websocket.WebSocketApp(
93 API_ENDPOINT,
94 header={"Authorization": YOUR_API_KEY},
95 on_open=on_open,
96 on_message=on_message,
97 on_error=on_error,
98 on_close=on_close,
99 )
100
101 ws_thread = threading.Thread(target=ws_app.run_forever)
102 ws_thread.daemon = True
103 ws_thread.start()
104
105 try:
106 while ws_thread.is_alive():
107 time.sleep(0.1)
108 except KeyboardInterrupt:
109 print("\nStopping...")
110 stop_event.set()
111 if ws_app and ws_app.sock and ws_app.sock.connected:
112 ws_app.send(json.dumps({"type": "Terminate"}))
113 time.sleep(2)
114 if ws_app:
115 ws_app.close()
116 ws_thread.join(timeout=2.0)
117
118if __name__ == "__main__":
119 run()

Prompting

Universal-3 Pro supports custom prompts and keyterms prompting to improve transcription accuracy for your use case. For detailed guidance on crafting effective prompts, default prompt behavior, and keyterms prompting, see the Prompting Guide (Streaming).

You can also boost recognition of specific terms using the keyterms_prompt parameter. See Keyterms prompting for details.

Configuring turn detection

Universal-3 Pro uses a punctuation-based turn detection system controlled by two parameters:

ParameterDefaultDescription
min_turn_silence100 msSilence duration before a speculative end-of-turn (EOT) check fires.
max_turn_silence1000 msMaximum silence before a turn is forced to end.

When silence reaches min_turn_silence, the model transcribes the audio and checks for terminal punctuation (. ? !):

  • Terminal punctuation found — the turn ends and is emitted as a final transcript (end_of_turn: true).
  • No terminal punctuation — a partial transcript is emitted (end_of_turn: false) and the turn continues waiting.
    • If silence continues to max_turn_silence, the turn is forced to end as a final transcript (end_of_turn: true) regardless of punctuation.

This differs from Universal-Streaming English and Multilingual, which use a confidence-based end-of-turn system controlled by end_of_turn_confidence_threshold.

Instead, Universal-3 Pro makes turn decisions based on ending punctuation after min_turn_silence has elapsed. Because of this, end_of_turn_confidence_threshold has no impact.

end_of_turn and turn_is_formatted

Because formatting is built into the end-of-turn system in Universal-3 Pro streaming, there is only ever one end-of-turn transcript per turn and it is always formatted. This means end_of_turn and turn_is_formatted always have the same value for Universal-3 Pro streaming. You can reliably use end_of_turn: true to detect a formatted, final end-of-turn transcript.

For example, to configure both parameters:

1{
2 "speech_model": "u3-rt-pro",
3 "min_turn_silence": 100,
4 "max_turn_silence": 1000
5}

Partials behavior

Partials are Turn events where end_of_turn is false. They are produced whenever min_turn_silence is met, but the ending punctuation doesn’t signal the end of a turn.

There can be multiple partial transcripts per turn, but each period of silence can produce at most one partial. If silence exceeds min_turn_silence, but speech resumes before max_turn_silence, the partial is emitted and the EOT check resets until the next period of silence.

If you’re running eager LLM inference on partial transcripts, we recommend setting min_turn_silence to 100.

Entity splitting (accuracy) vs Model Latency trade-off

Setting min_turn_silence too low can split entities like phone numbers and emails. We have found LLM steps fix this for voice agents, but we recommend testing carefully with your use case.

Formatting and turn detection

Because the model applies punctuation and formatting intelligently, this works well with formatting-based turn detection. For example, based purely on vocal tone:

  • "Pizza." — Statement
  • "Pizza?" — Questioning tone
  • "Pizza---" — Trailing off

The punctuation quality has been excellent when paired with custom turn detection models.

From testing, mid-turn emission looks like this — where each line is an additional partial leading up to the final end-of-turn transcript:

"Yeah my credit card number is--"
"One moment---"
"Its 8888-8888-8888-8888" ← end_of_turn: true

Each partial is emitted during a silence period within the turn. The final line with terminal punctuation triggers the end of turn.

Forcing a turn endpoint

You can force the current turn to end immediately by sending a ForceEndpoint message:

1{
2 "type": "ForceEndpoint"
3}

This is useful when your application knows the user has finished speaking based on external signals (e.g., a button press).

Specifying the transcription language

Universal-3 Pro Streaming does not support the language_code connection parameter — it is silently ignored. The language_detection parameter only controls whether language metadata (such as language_code and language_confidence) is returned on Turn events; it does not affect which language the model transcribes.

To guide the transcription language, use the prompt parameter as described below.

Providing language information ahead of time in the prompt helps the model with transcription tasks. For example, if the model is told to transcribe Spanish, audio could be transcribed “si”, but if told English, it could be transcribed “C”.

Although prompting is a beta feature, we’ve found good results when you build off of the default prompt — which is exactly what we do here for adding language information by prepending Transcribe <language>. to the default prompt.

Our team is running evaluations to determine the best method for attaching this context to the prompt, and we will update this section with the best methods. So far, we have seen that prepending language information with Transcribe <language>. to the default prompt improves the output:

Transcribe Spanish. Transcribe verbatim. Rules:
Always include punctuation in output.
Use period/question mark ONLY for complete sentences.
Use comma for mid-sentence pauses.
Use no punctuation for incomplete trailing speech.
Filler words (um, uh, so, like) indicate speaker will continue.

If you have multiple languages, append all languages like Transcribe multilingual conversation in English, Spanish, and German.

Supported languages and regional dialects

Universal-3 Pro Streaming supports 6 languages with out-of-the-box recognition of regional dialects and local speech variants. See the Supported languages page for the full language list and dialect reference.

Updating configuration mid-stream

You can update configuration during an active streaming session using UpdateConfiguration. This applies changes without needing to reconnect. The recommended approach is to dynamically update keyterms_prompt based on the current stage of your voice agent flow — if you expect certain answers or terminology at a specific stage, proactively add those as keyterms so the model recognizes them accurately.

1# Replace or establish new set of keyterms
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["Universal-3"]}')
3
4# Remove keyterms and reset context biasing
5websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": []}')

For example, if your voice agent is currently asking for the caller’s name and date of birth, send the expected terms for that stage:

1# Caller identification stage
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["Kelly Byrne-Donoghue", "date of birth", "January", "February"]}')

Then, when the conversation moves to a different stage (e.g., medical intake), update with the relevant terms:

1# Medical intake stage
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["cardiology", "echocardiogram", "Dr. Patel", "metoprolol"]}')

You can also update prompt, max_turn_silence, min_turn_silence, or any combination at the same time:

1{
2 "type": "UpdateConfiguration",
3 "keyterms_prompt": ["account number", "routing number"],
4 "max_turn_silence": 5000,
5 "min_turn_silence": 200
6}

Common reasons to update configuration mid-stream:

  • keyterms_prompt — Dynamically add terms relevant to the current stage of your voice agent flow. This is the most effective way to improve recognition accuracy mid-stream. See Keyterms prompting for details.
  • prompt — Pass updated behavioral or formatting instructions into the STT stream.
  • max_turn_silence — Increase for moments where you’d expect a longer pause, such as when a caller is reading out a credit card number, ID number, or address. Decrease it again afterward to resume snappier turn detection.
  • min_turn_silence — Tune how quickly speculative EOT checks fire. Lower values produce faster partials for eager LLM inference, while higher values reduce entity splitting for utterances with numbers or proper nouns.