Streaming Speech-to-Text

Power real-time voice experiences with ultra-fast and ultra-accurate speech-to-text, unlimited concurrency, and pricing that scales with you.

See the difference in real-time

Speak naturally. Universal-3 Pro Streaming captures what other models miss — try credit card numbers, email addresses, passwords, or company names.

Try saying a company name, like "Granola"...

Tap the Mic to start streaming
2:00
Tap the mic to start
0 turns
Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Built with the capabilities for every
real-time use case

With Universal-3 Pro and Universal-Streaming, every use case is covered. Build industry-leading voice agents, or power your real-time note-taking use case with every capability built in.  

Features
AssemblyAI
Universal-3 Pro Streaming
AssemblyAI
Universal-Streaming
Deepgram
Nova-3
OpenAI
GPT-4o Transcribe
Microsoft
Azure
ElevenLabs
Scribe V2
Average accuracy across entities
(Lower is better)

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

16.7%
22.9%
25.2%
23.3%
25.1%
22.1%
Speaker diarization performance

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Industry Leading
Unreliable
Unreliable
Unreliable
Unlimited concurrency, no rate limits

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Dynamic keyterms prompting
(turn-by-turn)

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Static only

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Real-time prompting

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Usage-based pricing, no contracts

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Commitments
and overages

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Contracts at scale

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

LiveKit / Pipecat / Twilio
native support

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Partial

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Real-time accuracy where Voice AI actually operates

Universal-3 Pro Streaming improves over Universal-Streaming, delivering accuracy in conditions voice agents actually face: telephony, accented speech, high-turn-taking conversations, and noisy call center environments.

Missed Entity Rate: Universal-3 Pro Streaming vs. Universal-Streaming

Lower is better  ·  % of entities not correctly transcribed

Universal-3 Pro Streaming
Universal-Streaming

Deepgram

Occupation

8.74%

10.13%

+1.39

Temporal

8.30%

9.91%

+1.61

Microsoft

Location

9.22%

12.99%

+3.77

Microsoft
Deepgram

Medical

14.78%

19.61%

+4.83

Deepgram

Organization

17.06%

21.41%

+4.35

Deepgram

Deepgram

Phone

34.79%

37.11%

+2.32

Deepgram

OpenAI

URL

49.03%

72.33%

+23.30

OpenAI

Deepgram

Email

59.64%

89.09%

+29.45

Entity Recognition on actual customer data

Names, dates, policy numbers, credit card numbers — the entities that drive outcomes are the ones most models get wrong. Universal-3 Pro Streaming delivers the lowest missed entity rates on real-world audio.

Missed Entity Rate by Category — All Providers

Lower is better  ·  Universal-3-Pro Streaming highlighted

Amazon

AssemblyAI Universal-3-Pro

34.3%

Amazon

AssemblyAI Universal-2

56.4%

Amazon

Amazon

Amazon Transcribe

71.3%

Amazon

Deepgram Nova-3

62.7%

Amazon

Amazon

ElevenLabs Scribe-2

62.1%

Amazon

Amazon

Microsoft Azure

63.7%

Amazon

Amazon

OpenAI GPT-4o Transcribe

72,1%

Amazon
Amazon

AssemblyAI Universal-3-Pro

12.0%%

Amazon

AssemblyAI Universal-2

14.7%

Amazon

Amazon

Amazon Transcribe

15.9%

Amazon

Deepgram Nova-3

15.1%

Amazon

Amazon

ElevenLabs Scribe-2

15.28%

Amazon
Amazon

Amazon

Microsoft Azure

18.4%

Amazon

Amazon

OpenAI GPT-4o Transcribe

13.0%

Amazon

AssemblyAI Universal-3-Pro

19.6%

Amazon

AssemblyAI Universal-2

23.2%

Amazon

Amazon

Amazon Transcribe

22.4%

Amazon

Deepgram Nova-3

30.0%

Amazon

Amazon

ElevenLabs Scribe-2

21.5%

Amazon

Amazon

Microsoft Azure

24.2%

Amazon

Amazon

OpenAI GPT-4o Transcribe

20.1%

Amazon
Amazon

AssemblyAI Universal-3-Pro

13.1%

Amazon

AssemblyAI Universal-2

14.6%

Amazon

Amazon

Amazon Transcribe

16.7%

Amazon

Deepgram Nova-3

16.5%

Amazon

Amazon

ElevenLabs Scribe-2

15.3%

Amazon

Amazon

Microsoft Azure

17.5%

Amazon

Amazon

OpenAI GPT-4o Transcribe

19.4%

Amazon

Word Error Rate (%) 

Lower is better  ·  English, all domains

AssemblyAI Universal-3-Pro

8.14%

Amazon

AssemblyAI Universal-2

9.02%

Amazon
Amazon

Amazon

ElevenLabs Scribe-2

9.11%

Amazon

Amazon

Microsoft Azure

9.11%

Amazon

OpenAI

OpenAI GPT4o Transcribe

9.90%

Amazon

Amazon

Deepgram Nova-3

11.06%

Amazon

Amazon

Amazon Transcribe

15.20%

Built for every streaming use case

Every feature engineered for the demands of real voice agent infrastructure.

Industry-leading entity accuracy

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Unlimited concurrency, no rate limits

Scale from a single call to millions without hitting limits or renegotiating contracts. Truly pay-as-you-go — no commitments required.

Real-time speaker diarization

Identify and separate speakers mid-conversation. Enable as a per-session toggle — no extra configuration needed.

Dynamic key term prompting

Boost up to 1,000 domain-specific terms, updated turn-by-turn mid-conversation. Unlike static alternatives, ours adapt in real time.

One-line integrations

Native support for LiveKit, PipeCat, Twilio, and Daily. Go from sign-up to a production voice agent in under 15 minutes.

Real-time Prompting
Beta

Guide transcription behavior with natural language in streaming mode. Start with our prompt templates — experiment and share what works.

Sub-200ms end-to-end latency

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Open community models

We've built the best voice AI inference infrastructure in the world — and we're opening it to community models, starting with Whisper Streaming.

Global language coverage

Full prompting with keyterms, diarization, and audio tagging in English, Spanish, German, French, Portuguese, and Italian

Ready to plug into your voice‑agent stack

Pre-built integrations with step‑by‑step docs enabling quick implementation without disrupting existing workflows.

More on our models

Universal-Streaming

Create voice experiences that feel more intuitive and responsive while maintaining the flexibility to optimize for your unique requirements.

Learn more

Universal-3 Pro Streaming

Universal-3 Pro Streaming gives your voice agents the accuracy, speed, and real-time control to handle real conversations at scale.

Built for voice agents

Start Building

Explore our comprehensive prompt engineering guide with use case templates, best practices, and an AI-powered prompt generator to optimize accuracy for your application.

Read the docs

Frequently Asked Questions

What is streaming speech-to-text and how does it work?

Streaming speech-to-text transcribes live audio as it’s spoken. You send audio over a secure WebSocket to the API, which returns transcripts within a few hundred milliseconds (~300 ms P50). Built for low latency, these models use limited context and apply intelligent endpointing to detect end‑of‑turns.

Can AssemblyAI handle unlimited concurrent audio streams?

Yes. Universal-Streaming supports unlimited concurrent streams with automatic scaling and no hard caps. Accounts start with per-minute new-stream limits (e.g., 100/min pay‑as‑you‑go) that increase 10% every 60s when ≥70% utilized. If you briefly exceed your current limit, new connections may return 1008 until it scales; baselines can be raised on request.

How do I get started with AssemblyAI's Streaming API?

Create a free account and get an API key, then connect to wss://streaming.assemblyai.com/v3/ws via SDK or WebSocket. Set sample_rate (e.g., 16000), start a microphone stream, send 50–1000 ms audio chunks, and handle Begin/Turn events. You’ll see transcripts within a few hundred milliseconds. Close the session when done.

 How much does AssemblyAI's streaming speech-to-text cost?

Universal-Streaming is $0.15 per hour. Billing is based on total session duration (time your connection stays open). Optional Keyterms Prompting add-on is $0.04/hr. The free tier includes up to 333 hours of streaming. Volume discounts and custom pricing are available.

What streaming features does AssemblyAI support?

Universal-Streaming delivers immutable, low-latency transcripts; intelligent, configurable endpointing using semantic plus acoustic cues; word-level timestamps and confidence; Keyterms Prompting (English) to boost critical vocabulary; and unlimited concurrent streams.

What languages does the streaming speech-to-text API support?

Universal-Streaming transcribes English by default. For multilingual streaming, use the universal-streaming-multilingual model, which supports English, Spanish, French, German, Italian, and Portuguese (beta). Additional languages are planned for late 2025/early 2026.

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.