English, Spanish, French, German, Italian, and Portuguese
Multilingual streaming allows you to transcribe audio streams in multiple languages.
Universal-Streaming Multilingual is billed on the total duration that your WebSocket connection stays open, not on the amount of audio you send. Always send a Terminate message when you’re done with a stream — sessions that aren’t closed auto-close after 3 hours and are billed for the full duration. See Billing and pricing for details.
If you need support beyond the 6 languages listed here, consider using the
Whisper Streaming model (speech_model: "whisper-rt"), which supports
99 languages with automatic language detection. See the Whisper
Streaming section below for details.
To utilize multilingual streaming, you need to include "speech_model":"universal-streaming-multilingual" as a query parameter in the WebSocket URL.
Multilingual currently supports: English, Spanish, French, German, Italian, and Portuguese.
The Python example uses the websockets library. If you’re using websockets version 13.0 or later, use additional_headers parameter. For older versions (< 13.0), use extra_headers instead.
The multilingual streaming model supports automatic language detection, allowing you to identify which language is being spoken in real-time. When enabled, the model returns the detected language code and confidence score with each complete utterance and final turn.
To enable language detection, include language_detection=true as a query parameter in the WebSocket URL:
When language detection is enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:
language_code: The language code of the detected language (e.g., "es" for Spanish, "fr" for French)language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detectionThe language_code and language_confidence fields only appear when either:
utterance field is non-empty and contains a complete utterance - The
end_of_turn field is trueHere’s an example Turn message with language detection enabled, showing Spanish being detected:
In this example, the model detected Spanish ("es") with a confidence of 0.999997.
The multilingual model produces transcripts with punctuation and capitalization already built into the model outputs. This means you’ll receive properly formatted text without requiring any additional post-processing.
While the API still returns the turn_is_formatted parameter to maintain
interface consistency with other streaming models, the multilingual model
doesn’t perform additional formatting operations. All transcripts from the
multilingual model are already formatted as they’re generated.
Whisper streaming allows you to transcribe audio streams in 99 languages using the WhisperLiveKit model. To use Whisper streaming, set speech_model to "whisper-rt" in the WebSocket URL.
The whisper-rt model does not support the language parameter. The model
automatically detects the language being spoken. Do not include a language
parameter when using this model.
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba
The Whisper streaming model supports automatic language detection, allowing you to identify which language is being spoken in real-time. To enable it, include language_detection=true as a query parameter in the WebSocket URL:
When enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:
language_code: The language code of the detected language (e.g., "es" for Spanish, "fr" for French)language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detectionThe language_code and language_confidence fields only appear when either:
utterance field is non-empty and contains a complete utterance - The
end_of_turn field is trueThe Whisper streaming model can detect and transcribe non-speech audio events. These are returned as bracketed tags in the utterance field. Common non-speech tags include:
[Silence] - Periods of silence or no speech[Música] / [Music] - Background music detectedNon-speech tags appear in the utterance field with brackets. The
transcript field contains the raw text without formatting. You can filter
out non-speech turns by checking if the utterance contains bracketed tags
like [Silence] or [Music].
By default, the Whisper streaming model returns unformatted transcripts. To receive formatted transcripts with proper punctuation and capitalization, set format_turns=true as a query parameter.
For voice agent pipelines, formatting is not required since LLMs process
unformatted text directly. For notetaking and closed captioning applications,
enable format_turns to make output human-readable.