Streaming Speech-to-Text

Power real-time voice experiences with ultra-fast and ultra-accurate speech-to-text, unlimited concurrency, and pricing that scales with you.

Use our API Contact sales

Universal-Streaming

Ultra-fast, ultra-accurate streaming speech-to-text

300 ms

word emission P50 latency

>91%

word accuracy rate

$0.15/hr

a fraction of the cost

Intelligent turn detection

Create voice experiences that feel more intuitive and responsive while maintaining the flexibility to optimize for your unique requirements.

Learn what’s new

See it in action

Hello! Try our newest Universal-Streaming speech-to-text model. Experience how fast and accurate it is in our Playground.

Try our playground

Ultra-fast transcription understands users as they speak

300 ms (P50) latency on immutable finals gives downstream services a head-start without mid-stream revisions.

Delivers reliable, unchanging transcripts from the beginning.
Adjustable speed↔post‑processing dial to fit every use case.
Almost 2x faster on P99 latencies compared to Deepgram Nova-3.

Intelligent endpointing for smoother turn detection

Conversations flow naturally—your agent replies with precise timing, reducing awkward pauses and itteruptions.

Maintain full control with configurable silence thresholds and confidence parameters to fine-tune the experience for your specific use case.
Decreases end‑of‑turn delay versus traditional silence detection.
Handle natural pauses without premature interruptions.

Superior accuracy where it matters

Accuratly capture names, numbers, and business terms—so LLM logic stays on track.

13% overall recognition improvements, ensuring superior accuracy across the board.
21% fewer alphanumeric errors on email addresses, confirmation codes, phone numbers, and ID numbers.
5% improvement in proper noun recognition for names of people, products, and businesses.

Pricing starts at $0.15/hr with unlimited streams

Premium performance comes at a fraction of the cost without capacity planning or surprise fees.

Transparent pricing across six languages starting at just $0.15/hr.
Unlimited concurrent streams with no hard caps or over-stream surcharges.
Consistent performance from 5 to 50,000+ streams without performance degradation or usage commitments.

Designed for voice experiences that feel more intuitive and responsive

Intelligent Endpointing

Combines acoustic and semantic features with traditional silence detection for faster, more accurate end-of-turn detection.

See how in docs

Automatic Concurrency Scaling

Handle thousands of concurrent connections without manual intervention, eliminating the need for complex connection management.

See how in docs

Developer Toggles

Fine-tune the balance between speed and accuracy with configurable API options for timestamps, formatting, and punctuation.

See how in docs

Enhanced Visibility

Monitor streaming performance metrics in real-time with comprehensive analytics and usage insights.

See how in docs

Auto Punctuation and Casing

Automatically add casing and punctuation of proper nouns to the transcription text.

See how in docs

See all in docs

Fewer correction loops and smoother conversations

Universal-Streaming delivers substantial accuracy improvements where it matters most to prevent "silent transcription errors."

The industry’s highest Word Accuracy Rate
Model	Overall	Alphanumerics	Proper Nouns
AssemblyAI Universal-Streaming	91.1%	94.6%	91.8%
Deepgram Nova-3	89.9%	93.3%	91.4%

Ready to plug into your voice‑agent stack

Pre-built integrations with step‑by‑step docs enabling quick implementation without disrupting existing workflows.

integration

LiveKit

integration

Vapi

integration

Pipecat

The speed difference is immediately noticeable - our users see their conversations transcribed almost instantaneously. It feels so much more responsive than what we were using before.

Jonathan Kim, Software Engineer

Continuously  up-to-date  and secure

Regular enhancements

Explore our changelog for detailed updates on the most recent product enhancements and improvements.

View changelog

Enterprise-grade security

AssemblyAI is committed to the highest standards of security practices to keep your data and your customers' data safe.

Explore more

Pre-recorded Speech-to-Text

Build on top of the most accurate Speech-to-Text model on the market with >93.3% accuracy.

Speech Understanding

Extract maximum value from voice data with Audio Intelligence, and leverage Large Language Models with LeMUR.

Frequently Asked Questions

What is streaming speech-to-text and how does it work?

Streaming speech-to-text transcribes live audio as it’s spoken. You send audio over a secure WebSocket to the API, which returns transcripts within a few hundred milliseconds (~300 ms P50). Built for low latency, these models use limited context and apply intelligent endpointing to detect end‑of‑turns.

Can AssemblyAI handle unlimited concurrent audio streams?

Yes. Universal-Streaming supports unlimited concurrent streams with automatic scaling and no hard caps. Accounts start with per-minute new-stream limits (e.g., 100/min pay‑as‑you‑go) that increase 10% every 60s when ≥70% utilized. If you briefly exceed your current limit, new connections may return 1008 until it scales; baselines can be raised on request.

How do I get started with AssemblyAI's Streaming API?

Create a free account and get an API key, then connect to wss://streaming.assemblyai.com/v3/ws via SDK or WebSocket. Set sample_rate (e.g., 16000), start a microphone stream, send 50–1000 ms audio chunks, and handle Begin/Turn events. You’ll see transcripts within a few hundred milliseconds. Close the session when done.

How much does AssemblyAI's streaming speech-to-text cost?

Universal-Streaming is $0.15 per hour. Billing is based on total session duration (time your connection stays open). Optional Keyterms Prompting add-on is $0.04/hr. The free tier includes up to 333 hours of streaming. Volume discounts and custom pricing are available.

What streaming features does AssemblyAI support?

Universal-Streaming delivers immutable, low-latency transcripts; intelligent, configurable endpointing using semantic plus acoustic cues; word-level timestamps and confidence; Keyterms Prompting (English) to boost critical vocabulary; and unlimited concurrent streams.

What languages does the streaming speech-to-text API support?

Universal-Streaming transcribes English by default. For multilingual streaming, use the universal-streaming-multilingual model, which supports English, Spanish, French, German, Italian, and Portuguese (beta). Additional languages are planned for late 2025/early 2026.

Turn voice data into unparalleled product experiences

Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.

Try our API for free Contact sales

Streaming Speech-to-Text

Universal-Streaming

Intelligent turn detection

See it in action

Ultra-fast transcription understands users as they speak

Intelligent endpointing for smoother turn detection

Superior accuracy where it matters

Pricing starts at $0.15/hr with unlimited streams

Designed for voice experiences that feel more intuitive and responsive

Intelligent Endpointing

Automatic Concurrency Scaling

Developer Toggles

Enhanced Visibility

Auto Punctuation and Casing

Fewer correction loops and smoother conversations

Ready to plug into your voice‑agent stack

LiveKit

Vapi

Pipecat

Continuously up-to-date and secure

Explore more

Pre-recorded Speech-to-Text

Speech Understanding

Frequently Asked Questions

Turn voice data into unparalleled product experiences

Continuously  up-to-date  and secure