Skip to main content

Mode

Universal-3.5 Pro Streaming workloads sit on a spectrum between two competing goals: returning transcripts as fast as possible, and returning the most accurate transcripts possible. To make this tradeoff explicit, Universal-3.5 Pro supports a mode connection parameter you can set when opening a streaming session.
ModeValueWhen to use
Min latencymin_latencyLowest possible time-to-text. Best when responsiveness matters more than catching every word.
Balanced (default)balancedA middle ground between latency and accuracy. Best for voice agents and other interactive applications.
Max accuracymax_accuracyHighest transcription accuracy. Best for note-taking, scribes, and post-call analysis where a small added delay is acceptable.
Set the mode connection parameter when you open the WebSocket.
CONNECTION_PARAMS = {
    "sample_rate": 16000,
    "speech_model": "universal-3-5-pro",
    "mode": "balanced",  # min_latency | max_accuracy
}

Language Selection

By default, Universal-3.5 Pro Streaming runs in multilingual mode. Pass a language_code connection parameter to bias the model toward a single language. This is useful when you know the session is monolingual and want to improve language accuracy.
ModelLanguages
Universal-3 Pro Streamingen, es, fr, de, it, pt
Universal-3-5 Pro Streamingen, es, fr, de, it, pt, tr, nl, sv, no, da, fi, hi, vi, ar, he, ja, zh
Set the language_code connection parameter when you open the WebSocket. Omit language_code to keep multilingual code-switching behavior.
CONNECTION_PARAMS = {
    "sample_rate": 16000,
    "speech_model": "universal-3-5-pro",
    "language_code": "es",
}

Advanced: Tuning turn detection parameters

Beyond the mode parameter, you can tune individual turn detection parameters to fine-tune partial cadence and turn endpointing for your use case. The parameters differ by model.
Universal-3.5 Pro Streaming uses punctuation-based turn detection. Turns end when terminal punctuation (. ? !) is detected; if no punctuation is detected within max_turn_silence, the turn ends anyway.Each mode ships with its own set of defaults for these parameters. Override any of them on the connection to fine-tune further.
ParameterDefaultDescription
min_turn_silencemin_latency: 96
balanced: 224
max_accuracy: 800
Silence (ms) before a speculative end-of-turn check fires. Lower = faster turn endings; higher = fewer entity splits on numbers and proper nouns.
max_turn_silencemin_latency: 416
balanced: 1536
max_accuracy: 1536
Maximum silence (ms) before forcing a turn to end, regardless of punctuation. Raise it when you expect a longer pause (caller reading a credit card, address).
interruption_delaymin_latency: 0
balanced: 500
max_accuracy: 500
Time to first partial (ms). Lower = faster TTFT for barge-in detection; higher = more confident first partials. The server adds ~300ms minimum on top.
continuous_partialsmin_latency: true
balanced: true
max_accuracy: true
When true, emit a partial every ~3s during continuous speech. Useful for long utterances where silence-based partials don’t fire often enough.
vad_thresholdmin_latency: 0.3
balanced: 0.2
max_accuracy: 0.2
Confidence threshold (0–1) for classifying audio frames as speech. Increase for noisy environments to reduce false speech detection.
Tuning recipe — long utterance prepWhen your voice agent prompts the user for a long utterance (credit card, phone number, address), raise min_turn_silence mid-stream so brief pauses don’t fragment the turn:
{ "type": "UpdateConfiguration", "min_turn_silence": 1000 }
After the response, restore the default:
{ "type": "UpdateConfiguration", "min_turn_silence": 100 }
See Updating configuration mid-stream for the full list of mid-stream parameters.
Universal Streaming uses confidence-based turn detection. The model predicts when speech naturally ends; if confidence exceeds end_of_turn_confidence_threshold and min_turn_silence has passed, the turn ends. Acoustic (silence-based) detection kicks in as a fallback after max_turn_silence.
ParameterDefaultDescription
end_of_turn_confidence_threshold0.4Confidence threshold for semantic end-of-turn. Higher = more confident before ending; lower = ends faster.
min_turn_silence400 msSilence required before a semantic end-of-turn fires.
max_turn_silence1280 msMaximum silence before forcing a turn to end via acoustic detection.
vad_thresholdConfidence threshold (0–1) for classifying audio frames as speech. Increase for noisy environments to reduce false speech detection.
Quick-start configurationsAggressive — short, rapid back-and-forth (e.g., IVR replacements, order confirmations):
const streamingConfig = {
  end_of_turn_confidence_threshold: 0.4,
  min_turn_silence: 160,
  max_turn_silence: 400,
};
Balanced — most conversational voice agents (e.g., customer support):
const streamingConfig = {
  end_of_turn_confidence_threshold: 0.4,
  min_turn_silence: 400,
  max_turn_silence: 1280,
};
Conservative — reflective or complex speech (e.g., healthcare, sales, legal):
const streamingConfig = {
  end_of_turn_confidence_threshold: 0.7,
  min_turn_silence: 800,
  max_turn_silence: 3600,
};
Disabling turn detectionIf you’re using your own VAD or turn detection model, send a ForceEndpoint event to force a turn boundary:
ws.send(json.dumps({"type": "ForceEndpoint"}))
Or set end_of_turn_confidence_threshold to 1 (acoustic-only fallback) or 0 (silence-only). Setting it to 0 is not recommended unless you have a custom turn detection model running on top — it forces a turn at every min_turn_silence-length pause and fragments mid-sentence thinking pauses.