Streaming Speech-to-Text
Power real-time voice experiences with ultra-fast and ultra-accurate speech-to-text, unlimited concurrency, and pricing that scales with you.
See the difference in real-time
Speak naturally. Universal-3 Pro Streaming captures what other models miss — try credit card numbers, email addresses, passwords, or company names.
Built with the capabilities for every
real-time use case
With Universal-3 Pro and Universal-Streaming, every use case is covered. Build industry-leading voice agents, or power your real-time note-taking use case with every capability built in.
Features | AssemblyAI Universal-3 Pro Streaming | AssemblyAI Universal-Streaming | Deepgram Nova-3 | OpenAI GPT-4o Transcribe | Microsoft Azure | ElevenLabs Scribe V2 |
|---|---|---|---|---|---|---|
(Lower is better) | 16.7% | 22.9% | 25.2% | 23.3% | 25.1% | 22.1% |
Industry Leading | Unreliable | Unreliable | Unreliable | |||
Static only | ||||||
Commitments and overages | Contracts at scale | |||||
Partial |
Real-time accuracy where Voice AI actually operates
Universal-3 Pro Streaming improves over Universal-Streaming, delivering accuracy in conditions voice agents actually face: telephony, accented speech, high-turn-taking conversations, and noisy call center environments.
Missed Entity Rate: Universal-3 Pro Streaming vs. Universal-Streaming
Lower is better · % of entities not correctly transcribed
Entity Recognition on actual customer data
Names, dates, policy numbers, credit card numbers — the entities that drive outcomes are the ones most models get wrong. Universal-3 Pro Streaming delivers the lowest missed entity rates on real-world audio.
Missed Entity Rate by Category — All Providers
Lower is better · Universal-3-Pro Streaming highlighted
Word Error Rate (%)
Lower is better · English, all domains
Built for every streaming use case
Every feature engineered for the demands of real voice agent infrastructure.
Industry-leading entity accuracy
Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.
Unlimited concurrency, no rate limits
Scale from a single call to millions without hitting limits or renegotiating contracts. Truly pay-as-you-go — no commitments required.
Real-time speaker diarization
Identify and separate speakers mid-conversation. Enable as a per-session toggle — no extra configuration needed.
Dynamic key term prompting
Boost up to 1,000 domain-specific terms, updated turn-by-turn mid-conversation. Unlike static alternatives, ours adapt in real time.
One-line integrations
Native support for LiveKit, PipeCat, Twilio, and Daily. Go from sign-up to a production voice agent in under 15 minutes.
Guide transcription behavior with natural language in streaming mode. Start with our prompt templates — experiment and share what works.
Sub-200ms end-to-end latency
Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.
Open community models
We've built the best voice AI inference infrastructure in the world — and we're opening it to community models, starting with Whisper Streaming.
Global language coverage
Full prompting with keyterms, diarization, and audio tagging in English, Spanish, German, French, Portuguese, and Italian
Ready to plug into your voice‑agent stack
Pre-built integrations with step‑by‑step docs enabling quick implementation without disrupting existing workflows.
More on our models
Universal-Streaming
Create voice experiences that feel more intuitive and responsive while maintaining the flexibility to optimize for your unique requirements.
Frequently Asked Questions
Streaming speech-to-text transcribes live audio as it’s spoken. You send audio over a secure WebSocket to the API, which returns transcripts within a few hundred milliseconds (~300 ms P50). Built for low latency, these models use limited context and apply intelligent endpointing to detect end‑of‑turns.
Yes. Universal-Streaming supports unlimited concurrent streams with automatic scaling and no hard caps. Accounts start with per-minute new-stream limits (e.g., 100/min pay‑as‑you‑go) that increase 10% every 60s when ≥70% utilized. If you briefly exceed your current limit, new connections may return 1008 until it scales; baselines can be raised on request.
Create a free account and get an API key, then connect to wss://streaming.assemblyai.com/v3/ws via SDK or WebSocket. Set sample_rate (e.g., 16000), start a microphone stream, send 50–1000 ms audio chunks, and handle Begin/Turn events. You’ll see transcripts within a few hundred milliseconds. Close the session when done.
Universal-Streaming is $0.15 per hour. Billing is based on total session duration (time your connection stays open). Optional Keyterms Prompting add-on is $0.04/hr. The free tier includes up to 333 hours of streaming. Volume discounts and custom pricing are available.
Universal-Streaming delivers immutable, low-latency transcripts; intelligent, configurable endpointing using semantic plus acoustic cues; word-level timestamps and confidence; Keyterms Prompting (English) to boost critical vocabulary; and unlimited concurrent streams.
Universal-Streaming transcribes English by default. For multilingual streaming, use the universal-streaming-multilingual model, which supports English, Spanish, French, German, Italian, and Portuguese (beta). Additional languages are planned for late 2025/early 2026.
Unlock the value of voice data
Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.


















