Universal-3 Pro Streaming on LiveKit

Overview

This guide covers integrating AssemblyAI’s Universal-3 Pro Streaming speech-to-text model into a LiveKit voice agent using the Agents framework.

When not explicitly provided, the default endpointing parameters for Universal-3 Pro Streaming differ on LiveKit versus using AssemblyAI’s API directly:

  • LiveKit AssemblyAI plugin defaults:
    • min_turn_silence=100
    • max_turn_silence=100
  • AssemblyAI API defaults:
    • min_turn_silence=100
    • max_turn_silence=1000

However, you can always override these by passing your own preferred values explicitly.

Misconfiguring these parameters is the most common cause of poor performance. Read the Turn detection section below for the recommended values per turn detection mode.

Support for Universal-3 Pro Streaming requires livekit-agents version 1.4.4 or later.

Turn detection

In LiveKit, how your agent detects the end of a user’s turn is controlled by the turn_detection parameter inside TurnHandlingOptions, which is passed to AgentSession via the turn_handling argument.

Universal-3 Pro Streaming uses a punctuation-based turn detection system, which checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score.

This means the min_turn_silence and max_turn_silence parameters you pass to AssemblyAI directly control when transcripts are emitted and when turns end. For more details on how this works, see Configuring turn detection.

Default parameter differences

Universal-3 Pro Streaming’s endpointing is controlled by two AssemblyAI API parameters — min_turn_silence and max_turn_silence — that you pass to the STT plugin. These are separate from LiveKit’s endpointing.min_delay and endpointing.max_delay (set inside TurnHandlingOptions).

ParameterAssemblyAI API defaultLiveKit plugin defaultDescription
min_turn_silence100 ms100 msSilence before a speculative end-of-turn check. If terminal punctuation (. ? !) is found, the turn ends. If not, a partial is emitted and the turn continues.
max_turn_silence1000 ms100 msMaximum silence before forcing the turn to end, regardless of punctuation.

The LiveKit plugin defaults are optimized for third-party turn detection models, where you want transcripts handed off as fast as possible. When using turn_detection="stt", you should explicitly set max_turn_silence=1000 if you’d like to mimic the behavior of streaming directly to the API without LiveKit.

Tuning endpointing parameters

These are the default values used when no parameters are explicitly provided. You will likely need to experiment with different values depending on your use case:

  • Increase min_turn_silence — when brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking.
  • Increase max_turn_silence — when the forced turn end is cutting off users mid-thought or splitting entities like phone numbers across turns, a higher value lets the model wait longer before forcing the turn to end when the model is unsure.

See the Entity splitting tradeoff section for examples.

With turn_detection="stt", AssemblyAI’s built-in punctuation-based turn detection determines when the user has finished speaking. AssemblyAI’s end_of_turn signals are then used directly by LiveKit to commit the turn.

In this mode, we recommend explicitly setting min_turn_silence=100 and max_turn_silence=1000. These are AssemblyAI’s API defaults and provide a good balance of responsiveness and accuracy.

The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100, which might be too aggressive for STT-based turn detection.

Recommended starting parameters (set on assemblyai.STT(), not on AgentSession):

ParameterDefaultDescription
min_turn_silence100 msSilence duration before a speculative end-of-turn (EOT) check fires.
max_turn_silence1000 msMaximum silence before a turn is forced to end.

How it works:

  1. User speaks → audio streams to AssemblyAI
  2. User pauses for 100ms → AssemblyAI checks for terminal punctuation
  3. If terminal punctuation (. ? !) → turn ends immediately
  4. If no terminal punctuation → partial emitted, turn continues waiting
  5. If silence reaches 1000ms → turn is forced to end regardless of punctuation

Endpointing min_delay is additive in STT mode

LiveKit’s endpointing min_delay (default 0.5 seconds) is applied on top of AssemblyAI’s own endpointing. In STT mode, this delay starts after the STT end-of-speech signal, meaning it adds up to 500ms of extra latency by default.

Set endpointing={"min_delay": 0} inside TurnHandlingOptions to avoid this. AssemblyAI’s own endpointing parameters (min_turn_silence and max_turn_silence) already control the timing, so an additional delay on the LiveKit side is unnecessary latency.

1from livekit.agents import AgentSession, TurnHandlingOptions
2
3session = AgentSession(
4 turn_handling=TurnHandlingOptions(
5 turn_detection="stt",
6 endpointing={"min_delay": 0}, # Avoid additive delay in STT mode
7 ),
8 stt=assemblyai.STT(
9 model="u3-rt-pro",
10 min_turn_silence=100, # Silence (ms) before a speculative end-of-turn check
11 max_turn_silence=1000, # Max silence (ms) before forcing the turn to end
12 vad_threshold=0.3,
13 ),
14 vad=silero.VAD.load(
15 activation_threshold=0.3,
16 ),
17)

LiveKit turn detection (with MultilingualModel())

As a third-party turn detection model, LiveKit’s turn detector runs on top of STT output to make turn decisions. AssemblyAI’s role is then just to provide transcripts as quickly as possible, while the turn detection model decides when the user is actually done speaking.

Use MultilingualModel() rather than EnglishModel(), as Universal-3 Pro Streaming supports English, Spanish, German, French, Portuguese, and Italian. MultilingualModel() covers support for all of these languages.

The LiveKit plugin defaults of min_turn_silence=100 and max_turn_silence=100 work well here, as max_turn_silence is brought down to match min_turn_silence so that transcripts are handed off to the turn detection model as fast as possible.

MultilingualModel parameters (set inside TurnHandlingOptionsendpointing, not on the STT plugin):

ParameterDefaultDescription
endpointing.min_delay0.5 sTime to wait before committing a turn when the model predicts a likely boundary.
endpointing.max_delay3.0 sMaximum time to wait when the model predicts the user will continue speaking. Has no effect without a turn detector model.

How it works:

  1. User speaks → audio streams to AssemblyAI
  2. User pauses for 100ms → AssemblyAI emits transcript (final and partial are the same) immediately
  3. LiveKit’s MultilingualModel() evaluates the transcript in conversational context
  4. If the model predicts a likely turn boundary → waits min_delay (0.5s) then commits the turn
  5. If the model predicts the user will continue → waits up to max_delay (3.0s) for more speech
1from livekit.agents import AgentSession, TurnHandlingOptions
2from livekit.plugins.turn_detector.multilingual import MultilingualModel
3
4session = AgentSession(
5 turn_handling=TurnHandlingOptions(
6 turn_detection=MultilingualModel(),
7 endpointing={
8 "min_delay": 0.5, # Time (s) to wait before committing a turn when the model is confident
9 "max_delay": 3.0, # Max time (s) to wait when the model is not confident
10 },
11 ),
12 stt=assemblyai.STT(
13 model="u3-rt-pro",
14 vad_threshold=0.3,
15 ),
16 vad=silero.VAD.load(
17 activation_threshold=0.3,
18 ),
19)

Other turn detection modes

  • vad:

    • Detect end of turn from speech and silence data alone using Silero VAD.
    • Turn boundaries are determined purely by voice activity without semantic context.
    • AssemblyAI’s turn detection parameters still control when transcripts are emitted, but it is recommended to leave them at the plugin defaults (min_turn_silence=100, max_turn_silence=100) so transcripts arrive as quickly as possible.
  • manual:

    • Disable automatic turn detection entirely.
    • You control turns explicitly using session.commit_user_turn(), session.clear_user_turn(), and session.interrupt().
    • See the manual turn control docs for details.

Entity splitting tradeoff

Lower min_turn_silence and max_turn_silence values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.

min_turn_silence too low

  • Speculative check fires too early, splitting entities on punctuation.

  • Example: User spells out an email address with brief pauses between parts. The speculative check fires at 100ms of silence, and the model adds terminal punctuation to each segment, ending the turn prematurely.

# With (min_turn_silence=100, max_turn_silence=1000)
"It's John." → FINAL (100ms pause, check fires, period found → turn ends)
"Smith." → FINAL
"At gmail.com." → FINAL
# With (min_turn_silence=400, max_turn_silence=1000)
"It's john.smith@gmail.com." → FINAL (single turn, properly formatted)

max_turn_silence too low

  • Forced turn-end cuts off user mid-thought.

  • Example: User pauses longer than 1 second to think mid-sentence. The forced end fires at 1000ms, splitting the utterance into two turns regardless of punctuation.

# With (min_turn_silence=100, max_turn_silence=1000)
"I wanted to check on my order from—" → FINAL (1000ms silence, forced end)
"last Tuesday, order number 4829." → FINAL (new turn)
# With (min_turn_silence=100, max_turn_silence=2000)
"I wanted to check on my order from last Tuesday, order number 4829." → FINAL (single turn)

Universal-3 Pro Streaming’s formatting is significantly better when it has full context in a single turn — email addresses, phone numbers, credit card numbers, and physical addresses all benefit from this.

LLMs downstream can usually piece together split entities, but if your use case involves alphanumeric dictation or entity extraction, consider increasing min_turn_silence and max_turn_silence during those portions of the conversation. You can update configuration mid-stream to raise max_turn_silence temporarily (e.g., to 20004000 ms) when expecting entity input, then lower it again afterward.

Even when using third-party turn detection, you may want to increase min_turn_silence or max_turn_silence if users are likely to speak slowly or dictate entities. While this adds latency, it improves accuracy by giving the model more audio context before emitting a transcript and keeping the full entity complete within the same turn.

VAD configuration

With turn_detection="stt", AssemblyAI also sends SpeechStarted events that LiveKit uses for barge-in/interruption handling. SpeechStarted is only emitted when the model produces a transcript.

Silero VAD is not strictly required in this mode, but it is still recommended as Silero runs locally and it can be faster than waiting for AssemblyAI’s SpeechStarted signal. LiveKit respects whichever signal arrives first, so Silero provides faster interruption while AssemblyAI’s signal serves as a reliable backup.

With MultilingualModel(), Silero VAD is required, as it is the only source of START_OF_SPEECH events for interruption in this mode. AssemblyAI’s SpeechStarted event is not used.

Threshold alignment

LiveKit’s Silero VAD defaults to an activation_threshold of 0.5. AssemblyAI’s vad_threshold defaults to 0.3. For best performance, we recommend setting both to 0.3.

Both should be adjusted together to the same value to ensure accurate transcription and consistent barge-in thresholds.

When the thresholds are mismatched, you get a dead zone: if Silero is at 0.5 and AssemblyAI is at 0.3, AssemblyAI will be actively transcribing speech that LiveKit hasn’t detected yet, delaying interruption. Keeping them aligned eliminates this.

1session = AgentSession(
2 stt=assemblyai.STT(
3 model="u3-rt-pro",
4 vad_threshold=0.3, # AssemblyAI's internal VAD onset
5 ),
6 vad=silero.VAD.load(
7 activation_threshold=0.3, # Match AssemblyAI's threshold
8 ),
9)

If you’re in a noisy environment and receiving false speech triggers, raise both stt.vad_threshold and vad.activation_threshold thresholds together.

Interruption handling

In voice agent conversations, users often produce backchannel utterances — “mhm”, “yeah”, “um”, “okay” — while the agent is speaking. These short fillers can trigger LiveKit’s interruption logic, causing the agent to stop mid-sentence even though the user didn’t intend to interrupt.

Two complementary filters address this problem. When combined, they provide strong guardrails for interruption and barge-in handling. A working reference implementation with both filters is available on GitHub.

Upstream backchannel filter

The backchannel filter intercepts STT events at the stt_node level — before they reach LiveKit’s audio_recognition. It checks each transcript event for known disfluencies and backchannels (“um”, “mhm”, “yeah”, “okay”, etc.) and drops the event entirely. Because the event never enters the pipeline, none of the downstream orchestration — interrupt gates, end-of-turn detection, and preemptive LLM generation — ever reacts to it.

The filter is implemented as a mixin class that wraps Agent.stt_node:

1from __future__ import annotations
2
3import logging
4import string
5import time
6from collections.abc import AsyncIterable
7
8from livekit import rtc
9from livekit.agents import Agent, stt
10from livekit.agents.voice import ModelSettings
11
12
13# "yes" / "no" deliberately omitted — in a booking flow a bare "yes"
14# is a real confirmation. Edit for your domain.
15BACKCHANNELS = frozenset({
16 "mhm", "mm", "mmhm", "mmhmm",
17 "uh", "uhhuh", "huh",
18 "um", "umm", "uhm",
19 "er", "erm",
20 "hmm", "hm",
21 "ah", "oh",
22 "yeah", "yep", "yup",
23 "okay", "ok",
24 "right", "alright", "gotcha",
25})
26
27_TRANSCRIPT_TYPES = {
28 stt.SpeechEventType.INTERIM_TRANSCRIPT,
29 stt.SpeechEventType.PREFLIGHT_TRANSCRIPT,
30 stt.SpeechEventType.FINAL_TRANSCRIPT,
31}
32
33_PUNCT_STRIP = str.maketrans("", "", string.punctuation)
34
35log = logging.getLogger("backchannel_stt_filter")
36
37
38def _is_all_backchannel(text: str) -> bool:
39 """Return True only when every token is a known backchannel."""
40 tokens = text.lower().translate(_PUNCT_STRIP).split()
41 return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)
42
43
44class BackchannelSTTFilterMixin:
45 """Drop backchannel-only transcripts while the agent is speaking."""
46
47 _FILTER_GRACE_S: float = 1.0
48 _last_speaking_at: float = 0.0
49
50 async def stt_node(
51 self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
52 ):
53 async for ev in Agent.default.stt_node(self, audio, model_settings):
54 if self._should_drop(ev):
55 text = ev.alternatives[0].text if ev.alternatives else ""
56 log.info(
57 "event_filtered transcript=%r ev_type=%s agent_state=%s",
58 text, ev.type, self.session.agent_state,
59 )
60 continue
61 yield ev
62
63 def _should_drop(self, ev: stt.SpeechEvent) -> bool:
64 now = time.monotonic()
65 if self.session.agent_state == "speaking":
66 self._last_speaking_at = now
67 elif now - self._last_speaking_at > self._FILTER_GRACE_S:
68 return False
69
70 if ev.type not in _TRANSCRIPT_TYPES:
71 return False
72
73 text = ev.alternatives[0].text if ev.alternatives else ""
74 return _is_all_backchannel(text)

How it works:

  1. While the agent is speaking (plus a 1-second grace window after speech ends), the filter inspects each STT transcript event
  2. It strips punctuation and checks whether every token in the transcript matches the BACKCHANNELS set
  3. Pure-filler transcripts like “mhm” or “yeah okay” are dropped — they never reach LiveKit’s pipeline
  4. Utterances with any non-filler token (e.g., “yeah I want the suite”) always pass through

The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case.

Short-utterance buffer clearing

LiveKit accumulates committed FINAL transcripts in a private buffer (_audio_transcript) across the user’s uncommitted turn. The _interrupt_by_audio_activity method checks the running word count against min_words to decide whether to pause TTS.

Without intervention, two consecutive short fillers like “yeah” + “um” sum to two words and trip the interrupt gate — even though each utterance on its own is below threshold.

This filter listens on the user_input_transcribed event and wipes the buffer whenever the agent is speaking and the user input falls below the configured min_words threshold. Each short utterance is evaluated independently rather than against the accumulated total.

This filter requires interruption.min_words to be set to 2 or higher. Without it, the word-count gate is disabled and the filter has no effect.

1from __future__ import annotations
2
3import logging
4
5from livekit.agents import AgentSession
6
7
8log = logging.getLogger("short_utterance_buffer_filter")
9
10
11def install_short_utterance_filter(session: AgentSession) -> None:
12 """Clear transcript buffers when a short utterance arrives during agent speech."""
13
14 @session.on("user_input_transcribed")
15 def _on_user_input_transcribed(ev) -> None:
16 word_count = len(ev.transcript.split())
17 min_words = session.options.interruption["min_words"]
18
19 if session.agent_state != "speaking":
20 return
21
22 if word_count >= min_words:
23 return
24
25 activity = getattr(session, "_activity", None)
26 recognition = getattr(activity, "_audio_recognition", None) if activity else None
27 if recognition is None:
28 return
29
30 # Wipe all three transcript buffers so short utterances
31 # don't accumulate past the interrupt threshold.
32 recognition._audio_transcript = ""
33 recognition._audio_interim_transcript = ""
34 recognition._audio_preflight_transcript = ""
35
36 # Best-effort: abort any in-flight preemptive LLM call
37 # triggered by this short utterance.
38 cancel = getattr(activity, "_cancel_preemptive_generation", None)
39 if callable(cancel):
40 try:
41 cancel()
42 except Exception:
43 log.debug("_cancel_preemptive_generation failed", exc_info=True)
44
45 log.info(
46 "buffer_cleared transcript=%r words=%d is_final=%s",
47 ev.transcript, word_count, ev.is_final,
48 )

This filter accesses LiveKit private APIs (_audio_transcript, _audio_interim_transcript, _audio_preflight_transcript, _cancel_preemptive_generation). Pin your livekit-agents version to avoid breakage on minor updates. Tested with livekit-agents>=1.5.

Wiring both filters into your agent

The two filters have non-overlapping failure modes: the backchannel filter stops known fillers before they enter the pipeline, while the buffer-clearing filter catches any short utterance (including unknown fillers or stutters) that slips past. Running both provides the strongest coverage.

Three changes are needed:

1. Mix in the backchannel filter on your agent class. Place it before Agent in the class bases so its stt_node runs first:

1from filters.backchannel_stt import BackchannelSTTFilterMixin
2
3class MyAgent(BackchannelSTTFilterMixin, Agent):
4 def __init__(self) -> None:
5 super().__init__(instructions="You are a helpful voice AI assistant.")

2. Configure the session with interruption.min_words >= 2. This enables the word-count gate that the buffer-clearing filter depends on:

1session = AgentSession(
2 stt=assemblyai.STT(
3 model="u3-rt-pro",
4 min_turn_silence=100,
5 max_turn_silence=1000,
6 vad_threshold=0.3,
7 ),
8 # llm=your_llm_plugin(),
9 # tts=your_tts_plugin(),
10 vad=None, # Recommended: disable VAD so only STT drives interruption
11 turn_handling={
12 "turn_detection": "stt",
13 "endpointing": {"min_delay": 1.0, "max_delay": 4.0},
14 "interruption": {
15 "enabled": True,
16 "resume_false_interruption": True,
17 "false_interruption_timeout": 1.5,
18 "min_words": 2, # Required for the buffer-clearing filter
19 },
20 },
21)

3. Install the buffer-clearing filter on the session, then start it with your agent:

1from filters.short_utterance_buffer import install_short_utterance_filter
2
3install_short_utterance_filter(session)
4
5await session.start(room=ctx.room, agent=MyAgent())

Setting vad=None with turn_detection="stt" is the recommended setup for both filters. This ensures only STT-based signals drive interruption, avoiding timing races from a competing VAD interrupt path. If you need VAD for faster barge-in, both filters still work — set interruption.min_words to 2 and ensure Silero’s activation_threshold matches vad_threshold.

Expected behavior

The table below shows how utterances are handled when both filters are active with min_words=2:

User utterance (during agent speech)Backchannel filterBuffer-clearing filterResult
”mhm”DroppedAgent continues
”um”DroppedAgent continues
”yeah yeah”DroppedAgent continues
Unknown short wordPasses throughBuffer clearedAgent continues
”yeah I’d like the suite”Passes throughPasses throughAgent interrupts
”suite please”Passes throughPasses throughAgent interrupts
”mhm” (1+ second after agent stops)Passes throughAgent responds

Prompt engineering

Beta feature

Prompting is considered a beta feature for Universal-3 Pro Streaming.

While it can be a powerful tool for customizing transcription output or improving accuracy in certain use cases, we recommend starting without a prompt to first establish baseline performance.

Once the default prompt has been tested, you can experiment with custom prompts to further optimize for your use case (e.g., language mix to expect (e.g., English and Hindi), use case or domain (e.g., medical, legal), etc.).

Universal-3 Pro Streaming supports a prompt parameter for custom transcription instructions. When no prompt is provided, a default prompt optimized for native (i.e. STT-based) turn detection is used automatically.

1stt=assemblyai.STT(
2 model="u3-rt-pro",
3 prompt="Your custom transcription instructions.",
4)

Tips:

  • Start with no prompt: the default prompt delivers strong accuracy out of the box, only add a custom prompt if you need to alter this behavior.
  • Specify the audio context: accent, domain, expected utterance length, etc.
  • Define punctuation rules: can improve downstream LLM processing
  • Preserve speech patterns: instruct the model to keep disfluencies and filler words for more natural agent interactions

Key terms boosting

Instead of prompt, use keyterms_prompt to boost recognition of specific names, brands, or domain terms:

1stt=assemblyai.STT(
2 model="u3-rt-pro",
3 keyterms_prompt=["AssemblyAI", "LiveKit", "Universal-3 Pro"],
4)

Updating configuration mid-stream

You can update prompt, keyterms_prompt, min_turn_silence, and max_turn_silence during an active session using update_options.

This is useful for dynamically adjusting turn detection behavior, like increasing max_turn_silence when expecting entity dictation, then lowering it again afterward. You can also update keyterms_prompt and prompt mid-stream after a database is loaded or crucial conversational information has been detected. For more information, see update configuration mid-stream.

1# Update one or more options mid-stream
2stt.update_options(
3 max_turn_silence=3000, # Increase for entity dictation
4)
5
6# Later, reset to default
7stt.update_options(
8 max_turn_silence=1000,
9)

Build and run your agent

Installation

Install the plugin and necessary packages (silero, codecs, dotenv) from PyPI:

$pip install "livekit-agents[assemblyai,silero,codecs]~=1.5" \
> python-dotenv

Make sure to install the latest version of livekit-agents from PyPI (support for Universal-3 Pro Streaming was added in livekit-agents@1.4.4). Older versions of the plugin will not recognize the u3-rt-pro model, resulting in a validation error.

If you plan to use LiveKit turn detection with MultilingualModel(), you also need to install the turn detector plugin:

$pip install "livekit-plugins-turn-detector~=1.0"

Noise cancellation can introduce audio artifacts that negatively impact transcription quality. In most cases, the artifacts introduced by noise cancellation cause more harm than the background noise itself, so we recommend not adding any audio pre-processing before it reaches Universal-3 Pro Streaming.

For a complete voice agent, you will also need to install LLM and TTS plugins for your chosen providers. See the LiveKit plugins documentation for available options.

Authentication

Set your API keys in a .env file:

1LIVEKIT_URL=wss://your-project.livekit.cloud
2LIVEKIT_API_KEY=your_livekit_api_key
3LIVEKIT_API_SECRET=your_livekit_api_secret
4ASSEMBLYAI_API_KEY=your_assemblyai_key
5# Add API keys for your chosen LLM and TTS providers

You can obtain an AssemblyAI API key by signing up here and navigating to the API Keys tab of the dashboard.

The following example uses turn_detection="stt" (recommended).

Pay close attention to the comments for using with MultilingualModel().

1from dotenv import load_dotenv
2from livekit import agents
3from livekit.agents import AgentSession, Agent, TurnHandlingOptions
4from livekit.plugins import (
5 assemblyai,
6 silero,
7)
8# For MultilingualModel, uncomment the following:
9# from livekit.plugins.turn_detector.multilingual import MultilingualModel
10
11load_dotenv()
12
13
14class Assistant(Agent):
15 def __init__(self) -> None:
16 super().__init__(instructions="You are a helpful voice AI assistant.")
17
18
19async def entrypoint(ctx: agents.JobContext):
20 await ctx.connect()
21
22 session = AgentSession(
23 stt=assemblyai.STT(
24 model="u3-rt-pro",
25 min_turn_silence=100,
26 max_turn_silence=1000, # When turn_detection="stt", override plugin default of 100.
27 # If using MultilingualModel(), the plugin defaults (min: 100, max: 100) work well. Omit min_turn_silence and max_turn_silence above if preferred.
28 vad_threshold=0.3, # Match Silero's activation_threshold
29 ),
30 # llm=your_llm_plugin(), # Add your LLM provider here
31 # tts=your_tts_plugin(), # Add your TTS provider here
32 vad=silero.VAD.load(
33 activation_threshold=0.3, # Match AssemblyAI's internal VAD threshold
34 ),
35 turn_handling=TurnHandlingOptions(
36 turn_detection="stt",
37 # To use LiveKit's turn detection instead, replace the line above with:
38 # turn_detection=MultilingualModel(),
39 endpointing={"min_delay": 0}, # Avoid additive delay in STT mode
40 # If using MultilingualModel(), set these instead:
41 # endpointing={"min_delay": 0.5, "max_delay": 3.0},
42 ),
43 )
44
45 await session.start(
46 room=ctx.room,
47 agent=Assistant(),
48 )
49
50 await session.generate_reply(
51 instructions="Greet the user and offer your assistance."
52 )
53
54
55if __name__ == "__main__":
56 agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

Running your agent

Start in development mode

$python your_agent_file.py dev

Test in the LiveKit Playground

  1. Go to agents-playground.livekit.io
  2. Connect to your LiveKit Cloud project (same credentials as your .env)
  3. Click Connect — a room will be created, your agent will join, and you can start talking

Parameters reference

Universal-3 Pro Streaming parameters

These are the key parameters to tune for LiveKit when using Universal-3 Pro Streaming:

speech_model
string

Set to "u3-rt-pro" for Universal-3 Pro Streaming.

keyterms_prompt
list of strings

List of terms to boost recognition for. Appended to the default prompt automatically.

prompt
string

Custom transcription instructions for the model. When not provided, a default prompt optimized for native turn detection is automatically applied.

Prompting is a beta feature for Universal-3 Pro Streaming. Start with no prompt to establish baseline performance before experimenting with custom prompts.

min_turn_silence
integerDefaults to 100

Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation to decide whether the turn has ended.

max_turn_silence
integerDefaults to 100

Maximum milliseconds of silence before the turn is forced to end, regardless of punctuation. The LiveKit plugin defaults to 100. Set to 1000 when using turn_detection="stt".

vad_threshold
floatDefaults to 0.3

AssemblyAI’s internal Silero VAD threshold. Universal-3 Pro Streaming defaults to 0.3, unlike Universal-Streaming’s 0.4. Align with LiveKit’s Silero activation_threshold for consistent behavior.

language_detection
booleanDefaults to true

Universal-3 Pro Streaming code-switches natively between supported languages. This parameter controls whether language_code and language_confidence are included in turn messages. Defaults to true in the LiveKit plugin, but false when using the API directly.

General STT parameters

These parameters apply to all AssemblyAI streaming models and can remain the same between models:

sample_rate
intDefaults to 16000

The sample rate of the audio stream.

encoding
strDefaults to pcm_s16le

The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw.

Legacy parameters

These parameters apply to the universal-streaming-english and universal-streaming-multilingual AssemblyAI streaming models, but do not affect Universal-3 Pro Streaming:

end_of_turn_confidence_threshold
floatDefaults to 0.4

Confidence threshold for end-of-turn detection. Universal-3 Pro Streaming uses punctuation-based turn detection instead.

format_turns
booleanDefaults to false

Whether to return formatted final transcripts. Universal-3 Pro Streaming always returns formatted transcripts, so this parameter no longer applies.

Troubleshooting

IssueCauseSolution
Extra latency with turn_detection="stt"LiveKit’s endpointing min_delay is additive in STT modeSet endpointing={"min_delay": 0} inside TurnHandlingOptions on AgentSession
No interruption handlingMissing VADEnsure vad=silero.VAD is set, with activation_threshold equal to vad_threshold (default 0.3)
Turn over-segmentationmin_turn_silence too lowIncrease from 100 to 200500
Entities split across turnsmax_turn_silence too lowIncrease max_turn_silence (e.g., 15003500)
Latency on non-terminal utterancesmax_turn_silence too highLower max_turn_silence

Migration from standard AssemblyAI STT

If you are migrating from the standard AssemblyAI streaming model:

ChangeFromTo
Modelassemblyai.STT()assemblyai.STT(model="u3-rt-pro")
Turn detectionturn_detection="stt" or EnglishModel()turn_detection="stt" or MultilingualModel()
VADOptionalSet vad=silero.VAD.load() to match vad_threshold
min_turn_silence400 (old default)100 (new default)
max_turn_silence1280 (old default)1000 (API default) or 100 (with 3rd-party turn detector)
end_of_turn_confidence_thresholdConfigurableNot applicableUniversal-3 Pro Streaming uses punctuation-based turn detection
endpointing.min_delay (formerly min_endpointing_delay)Default 0.5Set to 0 inside TurnHandlingOptions when using turn_detection="stt"