Multilingual streaming
Supported languages
English, Spanish, French, German, Italian, and Portuguese
Multilingual streaming allows you to transcribe audio streams in multiple languages.
Need more than 6 languages?
If you need support beyond the 6 languages listed here, consider using the Whisper Streaming model (speech_model: "whisper-rt"), which supports 99 languages with automatic language detection. See the Whisper Streaming section below for details.
Configuration
To utilize multilingual streaming, you need to include "speech_model":"universal-streaming-multilingual" as a query parameter in the WebSocket URL.
Supported languages
Multilingual currently supports: English, Spanish, French, German, Italian, and Portuguese.
Quickstart
Python SDK
Python
JavaScript SDK
Javascript
Firstly, install the required dependencies.
Python SDK
Python
Javascript
JavaScript SDK
Language detection
The multilingual streaming model supports automatic language detection, allowing you to identify which language is being spoken in real-time. When enabled, the model returns the detected language code and confidence score with each complete utterance and final turn.
Configuration
To enable language detection, include language_detection=true as a query parameter in the WebSocket URL:
Output format
When language detection is enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:
language_code: The language code of the detected language (e.g.,"es"for Spanish,"fr"for French)language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detection
The language_code and language_confidence fields only appear when either:
- The
utterancefield is non-empty and contains a complete utterance - Theend_of_turnfield istrue
Example response
Here’s an example Turn message with language detection enabled, showing Spanish being detected:
In this example, the model detected Spanish ("es") with a confidence of 0.999997.
Understanding formatting
The multilingual model produces transcripts with punctuation and capitalization already built into the model outputs. This means you’ll receive properly formatted text without requiring any additional post-processing.
While the API still returns the turn_is_formatted parameter to maintain
interface consistency with other streaming models, the multilingual model
doesn’t perform additional formatting operations. All transcripts from the
multilingual model are already formatted as they’re generated.
In the future, this built-in formatting capability will be extended to our English-only streaming model as well.
Whisper Streaming
Whisper streaming allows you to transcribe audio streams in 99 languages using the WhisperLiveKit model. To use Whisper streaming, set speech_model to "whisper-rt" in the WebSocket URL.
The whisper-rt model does not support the language parameter. The
model automatically detects the language being spoken. Do not include a
language parameter when using this model.
Supported languages (99)
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba
Language detection
The Whisper streaming model supports automatic language detection, allowing you to identify which language is being spoken in real-time. To enable it, include language_detection=true as a query parameter in the WebSocket URL:
When enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:
language_code: The language code of the detected language (e.g.,"es"for Spanish,"fr"for French)language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detection
The language_code and language_confidence fields only appear when either:
- The
utterancefield is non-empty and contains a complete utterance - The
end_of_turnfield istrue
Example response
Non-speech tags
The Whisper streaming model can detect and transcribe non-speech audio events. These are returned as bracketed tags in the utterance field. Common non-speech tags include:
[Silence]- Periods of silence or no speech[Música]/[Music]- Background music detected- Other audio events may appear in similar bracketed format
Non-speech tags appear in the utterance field with brackets. The
transcript field contains the raw text without formatting. You can filter
out non-speech turns by checking if the utterance contains bracketed tags
like [Silence] or [Music].
Understanding formatting
By default, the Whisper streaming model returns unformatted transcripts. To receive formatted transcripts with proper punctuation and capitalization, set format_turns=true as a query parameter.
Enabling format_turns adds additional latency to the transcription. We
recommend keeping it off for voice agents where low latency is critical,
and on for notetaking applications where formatted output is more
important than speed.