Optimizing Accuracy and Latency

Mode

Universal-3.5 Pro Streaming workloads sit on a spectrum between two competing goals: returning transcripts as fast as possible, and returning the most accurate transcripts possible. To make this tradeoff explicit, Universal-3.5 Pro supports a mode connection parameter you can set when opening a streaming session.

Mode	Value	When to use
Min latency	`min_latency`	Lowest possible time-to-text. Best when responsiveness matters more than catching every word.
Balanced (default)	`balanced`	A middle ground between latency and accuracy. Best for voice agents and other interactive applications.
Max accuracy	`max_accuracy`	Highest transcription accuracy. Best for note-taking, scribes, and post-call analysis where a small added delay is acceptable.

Set the mode connection parameter when you open the WebSocket.

Python
Python SDK
Javascript
JavaScript SDK

CONNECTION_PARAMS = {
    "sample_rate": 16000,
    "speech_model": "universal-3-5-pro",
    "mode": "balanced",  # min_latency | max_accuracy
}

client.connect(
    StreamingParameters(
        sample_rate=16000,
        speech_model="universal-3-5-pro",
        mode="balanced",  # min_latency | max_accuracy
    )
)

const CONNECTION_PARAMS = {
  sample_rate: 16000,
  speech_model: "universal-3-5-pro",
  mode: "balanced", // min_latency | max_accuracy
};

const transcriber = client.streaming.transcriber({
  sampleRate: 16_000,
  speechModel: "universal-3-5-pro",
  mode: "balanced", // min_latency | max_accuracy
});

Language selection

By default, Universal-3.5 Pro Streaming is multilingual and code-switches natively across all supported languages with no configuration. When you know which languages a session will use, pass the language_codes connection parameter to steer the model toward them and improve language accuracy.

For a known subset, pass the codes you expect (for example, ["en", "es"]). The model still code-switches, but heavily biased to the languages you list.
For a monolingual session, pass a single-element list (for example, ["es"]).
For full multilingual, omit language_codes to keep native code switching.

See Multilingual transcription for the full feature guide, including supported languages, language detection, and updating language_codes mid-stream.

Turn detection

Universal-3.5 Pro Streaming detects the end of a turn using acoustic and contextual cues rather than silence alone. The mode preset sets the defaults, and every turn detection parameter can be overridden on the connection or updated mid-stream to tune endpointing for your use case.

Raise min_turn_silence when brief pauses end turns too early, for example while a caller dictates a phone number. Raise max_turn_silence when you expect longer pauses within a turn.
Lower vad_threshold when quiet speech is missed. When background noise causes false interruptions, raise vad_threshold, increase interruption_delay, or enable Voice Focus.

See Turn detection for the full feature guide, including the turn lifecycle, entity capture, and bringing your own turn detection.

Universal Streaming

Universal Streaming uses confidence-based turn detection. The model predicts when speech naturally ends; if confidence exceeds end_of_turn_confidence_threshold and min_turn_silence has passed, the turn ends. Acoustic (silence-based) detection kicks in as a fallback after max_turn_silence.

Parameter	Default	Description
`end_of_turn_confidence_threshold`	`0.4`	Confidence threshold for semantic end-of-turn. Higher = more confident before ending; lower = ends faster.
`min_turn_silence`	`400` ms	Silence required before a semantic end-of-turn fires.
`max_turn_silence`	`1280` ms	Maximum silence before forcing a turn to end via acoustic detection.
`vad_threshold`	`0.4`	Confidence threshold (0 to 1) for classifying audio frames as speech. Increase for noisy environments to reduce false speech detection.

Quick-start configurationsAggressive, for short, rapid back-and-forth (IVR replacements, order confirmations).

const streamingConfig = {
  end_of_turn_confidence_threshold: 0.4,
  min_turn_silence: 160,
  max_turn_silence: 400,
};

Balanced, for most conversational voice agents (customer support).

const streamingConfig = {
  end_of_turn_confidence_threshold: 0.4,
  min_turn_silence: 400,
  max_turn_silence: 1280,
};

Conservative, for reflective or complex speech (healthcare, sales, legal).

const streamingConfig = {
  end_of_turn_confidence_threshold: 0.7,
  min_turn_silence: 800,
  max_turn_silence: 3600,
};

Disabling turn detectionIf you’re using your own VAD or turn detection model, send a ForceEndpoint event to force a turn boundary:

ws.send(json.dumps({"type": "ForceEndpoint"}))

Or set end_of_turn_confidence_threshold to 1 (acoustic-only fallback) or 0 (silence-only). Setting it to 0 is not recommended unless you have a custom turn detection model running on top, because it forces a turn at every min_turn_silence-length pause and fragments mid-sentence thinking pauses.

Getting started

Features

API reference

Advanced

Integrations

Guides

Optimizing Accuracy and Latency

Mode

Language selection

Turn detection

​Mode

​Language selection

​Turn detection

Mode

Language selection

Turn detection