Universal-3 Pro (Streaming)

Set up and configure Universal-3 Pro (Streaming) for real-time streaming transcription.

Universal-3-Pro: Public beta

Universal-3-Pro for streaming is currently in public beta. We are actively scaling infrastructure and refining the model. You can start building and testing with it today, but be aware that behavior may change as we continue to improve the experience.

Universal-3 Pro for streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.

This model is fantastic for voice agents, agent assist, and all streaming use cases that don’t require partial transcriptions for every single subword — partials are only produced during periods of silence, with at most one partial per silence period (see Partials behavior for details). Universal-3 Pro streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names — all with sub-300ms time to complete transcript latency.

Already using AssemblyAI streaming?

If you’re an existing AssemblyAI streaming user, you can quickly test Universal-3 Pro by switching the speech_model parameter to "u3-rt-pro" in your connection parameters. No other code changes are required — just update the model and start streaming.

Quickstart

Get started with Universal-3 Pro streaming using the code below. This example streams audio from your microphone and prints transcription results in real time — no custom prompt is needed, since Universal-3-Pro automatically applies a default prompt optimized for turn detection.

1

Install the required libraries

$pip install "assemblyai>=1.0.0" pyaudio
2

Create a new file main.py and paste the code below. Replace <YOUR_API_KEY> with your API key.

3

Run with python main.py and speak into your microphone.

1import logging
2from typing import Type
3
4import assemblyai as aai
5from assemblyai.streaming.v3 import (
6 BeginEvent,
7 StreamingClient,
8 StreamingClientOptions,
9 StreamingError,
10 StreamingEvents,
11 StreamingParameters,
12 TurnEvent,
13 TerminationEvent,
14)
15
16api_key = "<YOUR_API_KEY>"
17
18logging.basicConfig(level=logging.INFO)
19logger = logging.getLogger(__name__)
20
21def on_begin(self: Type[StreamingClient], event: BeginEvent):
22 print(f"Session started: {event.id}")
23
24def on_turn(self: Type[StreamingClient], event: TurnEvent):
25 print(f"{event.transcript} ({event.end_of_turn})")
26
27def on_terminated(self: Type[StreamingClient], event: TerminationEvent):
28 print(
29 f"Session terminated: {event.audio_duration_seconds} seconds of audio processed"
30 )
31
32def on_error(self: Type[StreamingClient], error: StreamingError):
33 print(f"Error occurred: {error}")
34
35def main():
36 client = StreamingClient(
37 StreamingClientOptions(
38 api_key=api_key,
39 api_host="streaming.assemblyai.com",
40 )
41 )
42
43 client.on(StreamingEvents.Begin, on_begin)
44 client.on(StreamingEvents.Turn, on_turn)
45 client.on(StreamingEvents.Termination, on_terminated)
46 client.on(StreamingEvents.Error, on_error)
47
48 client.connect(
49 StreamingParameters(
50 sample_rate=16000,
51 speech_model="u3-rt-pro",
52 )
53 )
54
55 try:
56 client.stream(
57 aai.extras.MicrophoneStream(sample_rate=16000)
58 )
59 finally:
60 client.disconnect(terminate=True)
61
62if __name__ == "__main__":
63 main()

Start with no prompt

We strongly recommend testing with no prompt first. When you omit the prompt parameter, Universal-3-Pro automatically applies a built-in default prompt optimized for turn detection and streaming accuracy — delivering 88% turn detection accuracy out of the box.

If you’re going to build a prompt, start with the default prompt and then tweak it for your use case. You should not start from scratch with your prompt — use the default prompt and then build off of it. See the Prompting Guide (Streaming) if you’d like to build your prompt yourself.

Remember, prompts are primarily instructional, so adding a large amount of context may not make a significant impact on accuracy and could reduce instruction-following coherence. Feel free to layer in additional instructions that you see in the Prompting Guide (Streaming).

Universal-3-Pro also supports keyterms_prompt for boosting specific terms. See Keyterms prompting for details.

Keyterms prompting

Use the keyterms_prompt parameter to boost recognition of specific names, brands, or domain terms. Behind the scenes, keyterms_prompt relies on the default prompt and appends your boosted words to it. Pass an array of terms you want the model to prioritize:

1keyterms_prompt=["Keanu Reeves", "AssemblyAI", "Universal-2"],

You can set keyterms_prompt at connection time or update it mid-stream as the conversation progresses. For full details, see Keyterms prompting.

Prompt and Keyterms Prompt

The prompt and keyterms_prompt parameters cannot be used in the same request. Please choose either one or the other based on your use case. When you use keyterms_prompt, your boosted words are appended to the default prompt automatically.

Default prompt

When no prompt is provided in the connection parameters, the following default prompt is used. If you prefer the out-of-the-box performance but want to slightly tweak or add instructions and context, you can use it as a starting point:

Transcribe verbatim. Rules:
1) Always include punctuation in output.
2) Use period/question mark ONLY for complete sentences.
3) Use comma for mid-sentence pauses.
4) Use no punctuation for incomplete trailing speech.
5) Filler words (um, uh, so, like) indicate speaker will continue.

You can override the default prompt by providing your own prompt value. For detailed guidance on crafting effective prompts for streaming, see the Prompting Guide (Streaming).

Configuring turn detection

Universal-3-Pro uses a punctuation-based turn detection system controlled by two parameters:

ParameterDefaultDescription
min_end_of_turn_silence_when_confident100 msSilence duration before a speculative end-of-turn (EOT) check fires.
max_turn_silence1200 msMaximum silence before a turn is forced to end.

When silence reaches min_end_of_turn_silence_when_confident, the model transcribes the audio and checks for terminal punctuation (. ? !):

  • Terminal punctuation found — the turn ends and is emitted as a final transcript (end_of_turn: true).
  • No terminal punctuation — a partial transcript is emitted (end_of_turn: false) and the turn continues waiting.
    • If silence continues to max_turn_silence, the turn is forced to end as a final transcript (end_of_turn: true) regardless of punctuation.

This differs from Universal-Streaming English and Multilingual, which use a confidence-based end-of-turn system controlled by end_of_turn_confidence_threshold.

Instead, Universal-3-Pro makes turn decisions based on ending punctuation after min_end_of_turn_silence_when_confident has elapsed. Because of this, end_of_turn_confidence_threshold has no impact.

end_of_turn and turn_is_formatted

Because formatting is built into the end-of-turn system in Universal-3-Pro streaming, there is only ever one end-of-turn transcript per turn and it is always formatted. This means end_of_turn and turn_is_formatted always have the same value for Universal-3-Pro streaming. You can reliably use end_of_turn: true to detect a formatted, final end-of-turn transcript.

For example, to configure both parameters:

1{
2 "speech_model": "u3-rt-pro",
3 "min_end_of_turn_silence_when_confident": 100,
4 "max_turn_silence": 1200
5}

Partials behavior

Partials are Turn events where end_of_turn is false. They are produced whenever min_end_of_turn_silence_when_confident is met, but the ending punctuation doesn’t signal the end of a turn.

There can be multiple partial transcripts per turn, but each period of silence can produce at most one partial. If silence exceeds min_end_of_turn_silence_when_confident, but speech resumes before max_turn_silence, the partial is emitted and the EOT check resets until the next period of silence.

If you’re running eager LLM inference on partial transcripts, we recommend setting min_end_of_turn_silence_when_confident to 100.

Entity splitting (accuracy) vs Model Latency trade-off

Setting min_end_of_turn_silence_when_confident too low can split entities like phone numbers and emails. We have found LLM steps fix this for voice agents, but we recommend testing carefully with your use case.

Formatting and turn detection

Because the model applies punctuation and formatting intelligently, this works well with formatting-based turn detection. For example, based purely on vocal tone:

  • "Pizza." — Statement
  • "Pizza?" — Questioning tone
  • "Pizza---" — Trailing off

The punctuation quality has been excellent when paired with custom turn detection models.

From testing, mid-turn emission looks like this — where each line is an additional partial leading up to the final end-of-turn transcript:

"Yeah my credit card number is--"
"One moment---"
"Its 8888-8888-8888-8888" ← end_of_turn: true

Each partial is emitted during a silence period within the turn. The final line with terminal punctuation triggers the end of turn.

Forcing a turn endpoint

You can force the current turn to end immediately by sending a ForceEndpoint message:

1{
2 "type": "ForceEndpoint"
3}

This is useful when your application knows the user has finished speaking based on external signals (e.g., a button press).

Updating configuration mid-stream

You can update configuration during an active streaming session using UpdateConfiguration. This applies changes without needing to reconnect. The recommended approach is to dynamically update keyterms_prompt based on the current stage of your voice agent flow — if you expect certain answers or terminology at a specific stage, proactively add those as keyterms so the model recognizes them accurately.

1# Replace or establish new set of keyterms
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["Universal-3"]}')
3
4# Remove keyterms and reset context biasing
5websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": []}')

For example, if your voice agent is currently asking for the caller’s name and date of birth, send the expected terms for that stage:

1# Caller identification stage
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["Kelly Byrne-Donoghue", "date of birth", "January", "February"]}')

Then, when the conversation moves to a different stage (e.g., medical intake), update with the relevant terms:

1# Medical intake stage
2websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["cardiology", "echocardiogram", "Dr. Patel", "metoprolol"]}')

You can also update prompt, max_turn_silence, min_end_of_turn_silence_when_confident, or any combination at the same time:

1{
2 "type": "UpdateConfiguration",
3 "keyterms_prompt": ["account number", "routing number"],
4 "max_turn_silence": 5000,
5 "min_end_of_turn_silence_when_confident": 200
6}

Common reasons to update configuration mid-stream:

  • keyterms_prompt — Dynamically add terms relevant to the current stage of your voice agent flow. This is the most effective way to improve recognition accuracy mid-stream. See Keyterms prompting for details.
  • prompt — Pass updated behavioral or formatting instructions into the STT stream.
  • max_turn_silence — Increase for moments where you’d expect a longer pause, such as when a caller is reading out a credit card number, ID number, or address. Decrease it again afterward to resume snappier turn detection.
  • min_end_of_turn_silence_when_confident — Tune how quickly speculative EOT checks fire. Lower values produce faster partials for eager LLM inference, while higher values reduce entity splitting for utterances with numbers or proper nouns.