Streaming Speech-to-Text
Power real-time voice experiences with ultra-fast and ultra-accurate speech-to-text, unlimited concurrency, and pricing that scales with you.
Universal-Streaming
Ultra-fast, ultra-accurate streaming speech-to-text
Intelligent turn detection
Create voice experiences that feel more intuitive and responsive while maintaining the flexibility to optimize for your unique requirements.

See it in action
Hello! Try our newest Universal-Streaming speech-to-text model. Experience how fast and accurate it is in our Playground.
Ultra-fast transcription understands users as they speak
300 ms (P50) latency on immutable finals gives downstream services a head-start without mid-stream revisions.
- Delivers reliable, unchanging transcripts from the beginning.
- Adjustable speed↔post‑processing dial to fit every use case.
- Almost 2x faster on P99 latencies compared to Deepgram Nova-3.
Intelligent endpointing for smoother turn detection
Conversations flow naturally—your agent replies with precise timing, reducing awkward pauses and itteruptions.
- Maintain full control with configurable silence thresholds and confidence parameters to fine-tune the experience for your specific use case.
- Decreases end‑of‑turn delay versus traditional silence detection.
- Handle natural pauses without premature interruptions.
Superior accuracy where it matters
Accuratly capture names, numbers, and business terms—so LLM logic stays on track.
- 13% overall recognition improvements, ensuring superior accuracy across the board.
- 21% fewer alphanumeric errors on email addresses, confirmation codes, phone numbers, and ID numbers.
- 5% improvement in proper noun recognition for names of people, products, and businesses.
Pricing starts at $0.15/hr with unlimited streams
Premium performance comes at a fraction of the cost without capacity planning or surprise fees.
- Transparent pricing across six languages starting at just $0.15/hr.
- Unlimited concurrent streams with no hard caps or over-stream surcharges.
- Consistent performance from 5 to 50,000+ streams without performance degradation or usage commitments.
Designed for voice experiences that feel more intuitive and responsive
Intelligent Endpointing
Combines acoustic and semantic features with traditional silence detection for faster, more accurate end-of-turn detection.
Automatic Concurrency Scaling
Handle thousands of concurrent connections without manual intervention, eliminating the need for complex connection management.
Developer Toggles
Fine-tune the balance between speed and accuracy with configurable API options for timestamps, formatting, and punctuation.

Enhanced Visibility
Monitor streaming performance metrics in real-time with comprehensive analytics and usage insights.

Auto Punctuation and Casing
Automatically add casing and punctuation of proper nouns to the transcription text.
Fewer correction loops and smoother conversations
Model | Overall | Alphanumerics | Proper Nouns |
|---|---|---|---|
AssemblyAI Universal-Streaming | 91.1% | 94.6% | 91.8% |
Deepgram Nova-3 | 89.9% | 93.3% | 91.4% |
Ready to plug into your voice‑agent stack
Pre-built integrations with step‑by‑step docs enabling quick implementation without disrupting existing workflows.
The speed difference is immediately noticeable - our users see their conversations transcribed almost instantaneously. It feels so much more responsive than what we were using before.

Frequently Asked Questions
Streaming speech-to-text transcribes live audio as it’s spoken. You send audio over a secure WebSocket to the API, which returns transcripts within a few hundred milliseconds (~300 ms P50). Built for low latency, these models use limited context and apply intelligent endpointing to detect end‑of‑turns.
Yes. Universal-Streaming supports unlimited concurrent streams with automatic scaling and no hard caps. Accounts start with per-minute new-stream limits (e.g., 100/min pay‑as‑you‑go) that increase 10% every 60s when ≥70% utilized. If you briefly exceed your current limit, new connections may return 1008 until it scales; baselines can be raised on request.
Create a free account and get an API key, then connect to wss://streaming.assemblyai.com/v3/ws via SDK or WebSocket. Set sample_rate (e.g., 16000), start a microphone stream, send 50–1000 ms audio chunks, and handle Begin/Turn events. You’ll see transcripts within a few hundred milliseconds. Close the session when done.
Universal-Streaming is $0.15 per hour. Billing is based on total session duration (time your connection stays open). Optional Keyterms Prompting add-on is $0.04/hr. The free tier includes up to 333 hours of streaming. Volume discounts and custom pricing are available.
Universal-Streaming delivers immutable, low-latency transcripts; intelligent, configurable endpointing using semantic plus acoustic cues; word-level timestamps and confidence; Keyterms Prompting (English) to boost critical vocabulary; and unlimited concurrent streams.
Universal-Streaming transcribes English by default. For multilingual streaming, use the universal-streaming-multilingual model, which supports English, Spanish, French, German, Italian, and Portuguese (beta). Additional languages are planned for late 2025/early 2026.
Turn voice data into unparalleled product experiences
Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.





















