November 12, 2025

Introducing Multilingual Universal-Streaming: Go global with ultra-fast, ultra-accurate real-time speech-to-text

Universal-Streaming now supports six languages—English, Spanish, French, German, Italian, and Portuguese—in a single, unified model, while maintaining the superior accuracy that makes it the industry-leading streaming speech-to-text solution for voice agents.

AI voice agents

Streaming Speech-to-Text

Madison Bernstein

Product Marketing

Madison Bernstein

Product Marketing

Reviewed by

No items found.

Table of contents

[Visible on live site]

Real-time speech AI is transforming how businesses operate globally, from voice agents handling customer calls to AI assistants capturing meeting notes in multiple languages. But expanding beyond English presents technical and business challenges that force companies to choose between market reach and product quality.

Now, you can eliminate those trade-offs.

Universal-Streaming delivers six languages in one unified model (English, Spanish, French, German, Italian, and Portuguese) with industry-leading multilingual performance for production real-time applications.

Try Universal Multilingual

Real-time Voice AI needs to speak everyone's language

The demand for multilingual real-time speech AI spans every industry and use case. Voice agents serving customers in Madrid. Real-time agent assist tools helping support teams in São Paulo. Meeting assistants capturing board discussions across European offices. Medical scribes documenting patient consultations in multiple languages. But going global with real-time speech isn't always straightforward.

The hidden costs of going global

Most providers treat non-English languages as premium add-ons, charging up to 2x for multilingual support, with specialized languages requiring separate minimum commitments. However, the real expense arises from inconsistencies in accuracy across languages. A 3% higher error rate translates to 5-10% more in human QA costs, which becomes an expensive oversight at scale.

In regulated environments like healthcare or finance, where real-time transcription must be precise, these variable accuracy levels create unpredictable operational expenses throughout your workflows.

Unified architecture built for multilingual performance

Universal-Streaming eliminates these barriers with a single, unified architecture trained on all languages simultaneously. Rather than routing requests through language detection gateways, the model processes audio directly with a shared architecture optimized for multilingual conversations.

Key benefits include:

Instant processing: The model handles all six languages natively in a single forward pass, eliminating detection latency and routing complexity from your stack.

Natural code-switching: The use of an audio embedding space shared across all languages allows the model to handle intra-utterance language switches without special handling:

"Hola, can you help me find el restaurante on Main Street?"
"Je voudrais un coffee, s'il vous plaît, large with extra milk"

You get consistent quality across all languages with transparent, scalable pricing and a single WebSocket connection.

Consistent improvements: When we improve the model, every language benefits from the same architectural enhancements, decoder optimizations, and attention mechanism improvements.

Real-world performance benchmarks

Word Error Rate (WER) and P50 Latency measured on diverse real-world audio including call center recordings with background noise, medical consultations with domain terminology, and business meetings with multiple speakers. Any model can perform well in perfect audio conditions, but phone calls and meeting rooms are far less than perfect and present real challenges.

Why these metrics matter:

WER (Word Error Rate) directly impacts user experience: Lower WER means fewer frustrating misunderstandings in your voice agents and more accurate transcripts for downstream processing.
P50 latency affects conversation flow: Consistent sub-400ms latency ensures your voice agents respond naturally without awkward pauses that break the conversation.

Provider	Average WER	P50 Latency
AssemblyAI	11.77%	303
Deepgram (Nova-3)	12.76%	449

Testing methodology: Evaluated across datasets of multilingual audio with samples from all six supported languages. Latency was measured as the time from when each word ended in the audio stream to when that same word first appeared in the transcript. Only words transcribed correctly were included, and the P50 (median) of those word-level latencies was calculated.

Test with your own audio: These benchmarks are a starting point, but your specific use case matters most. Use our Playground to speak to the model and evaluate Universal-Streaming's accuracy and latency with your actual data before integrating.

One price for all languages

Global adoption shouldn't limit your ability to scale. With all languages priced equally at $0.15/hr (including non-English), you can expand your real-time applications confidently across markets.

Provider	Multi-Lingual per hour cost
AssemblyAI	$0.15/hr
Deepgram Nova-3	$0.55/hr
Google STT	$0.96/hr
AWS Transcribe	$1.44/hr
Gladia	$0.76/hr
Speechmatics (Enhanced)	$0.56/hr

Production-ready from day one

Every transcript arrives perfectly formatted with the capabilities you need to deliver exceptional voice-agent experiences:

Punctuated text: Proper punctuation and sentence boundaries are included automatically, so transcripts are immediately readable for end users and ready for downstream LLM processing without additional formatting steps.

Proper capitalization: Names, places, and sentences formatted appropriately and consistently. You get professional-quality output without building custom post-processing pipelines to handle proper nouns or sentence structure.

Intelligent endpointing: Built-in end-of-turn detection for natural conversations. The model automatically identifies when a speaker has finished their turn, enabling your voice agent to respond at the right moment without awkward interruptions or delays that break conversation flow.

Integration in minutes, not months

Universal-Streaming is compatible with your existing stack. You can start building immediately:

LiveKit: Native integration as the default provider
Vapi: Configure multilingual agents with one parameter
Pipecat/Daily: Drop-in replacement for any STT provider
Via the API: Same WebSocket, new languages

Get started with a few lines of code

Just getting started? Simply set the "speech_model" to "universal-streaming-multilingual". For a quick start guide, check out our full docs.

BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
CONNECTION_PARAMS = {
   "sample_rate": RATE,
   "format_turns": True,
   "speech_model": "universal-streaming-multilingual",
}

Build global voice agents with Universal-Streaming

Three easy ways to get started:

Implement immediately: Multilingual is available now through our API. Simply open a websocket to our wss://streaming.assemblyai.com/v3/ws endpoint using your current API key.
Try it in the Playground: Use our no-code Playground to see Universal-Streaming's performance with your specific audio and use cases using our interactive testing environment.
Explore the documentation: Review our comprehensive Getting Started Guide and technical documentation for detailed implementation information.

‍