For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PlaygroundChangelogSign In
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
  • Getting started
    • Transcribe streaming audio
    • Model selection
    • View model benchmarks
    • Evaluate model accuracy
    • Cloud endpoints & data residency
    • Manage concurrent sessions
    • Webhooks
    • Self-hosted streaming
  • Models
    • Whisper Streaming
    • Medical Mode
  • Features
    • Boost specific terms
    • Label speakers and separate channels
    • PII redaction
    • Filter profanity
    • Authenticate with a temporary token
    • Common session errors and closures
  • Integrations
    • LiveKit
    • Pipecat
  • Guides
LogoLogo
PlaygroundChangelogSign In
On this page
  • Word error rate (WER)
  • English benchmarks
  • Multilingual benchmarks
  • Missed entity rate
  • Latency
  • Emission latency
  • Time to Complete Transcript (TTCT)
  • Latency gaming
  • External benchmarks
  • Methodology
  • Run your own benchmark
Getting started

Benchmarks

Industry-leading accuracy for streaming speech-to-text.

Was this page helpful?
Previous

Evaluating Streaming STT models for Voice Agents

Next
Built with

Benchmarks are an important first step before running your own evaluation. Below are the current benchmarks for our streaming models so you can assess performance across accuracy, latency, and error rates.

Public benchmarks can be misleading due to overfitting and benchmark gaming. We strongly recommend running your own evaluation on your audio data to identify the best model for your use case.

For the full interactive benchmark experience with competitive comparisons, visit assemblyai.com/benchmarks.

Word error rate (WER)

Word Error Rate (WER) is the classical metric for speech-to-text accuracy. It counts substitutions, deletions, and insertions against a reference transcript, divided by the total word count in the ground truth.

WER weights every word equally, so a misrecognized filler word counts the same as a misrecognized email or phone number. For voice agents, we recommend pairing WER with Missed entity rate, which measures accuracy on the high-stakes entities — names, emails, phone numbers, and medical terms — that break agent conversion in production.

English benchmarks

Most recent update: March 2026.

DatasetUniversal-3 Pro WER (%)Universal Streaming WER (%)Relative gain vs Universal Streaming
Overall PerformanceMean: 6.3% | Median: 6.1%Mean: 8.6% | Median: 7.8%Mean: 26.7% | Median: 21.8%
commonvoice6.11%11.81%48.3%
earnings219.25%12.37%25.2%
librispeech_test_clean1.78%2.71%34.3%
librispeech_test_other3.11%5.82%46.6%
meanwhile5.74%6.73%14.7%
tedlium7.50%7.81%4.0%
rev1610.86%12.99%16.4%

Multilingual benchmarks

Most recent update: March 2026.

Language CodeLanguageUniversal-3 Pro WER (%)Universal Streaming WER (%)Relative gain vs Universal Streaming
AverageAll8.49%11.74%27.7%
deGerman11.79%13.99%15.7%
enEnglish8.43%12.94%34.9%
esSpanish7.63%9.81%22.2%
frFrench9.59%16.53%42.0%
itItalian5.60%7.36%23.9%
ptPortuguese7.88%9.83%19.8%

Missed entity rate

For voice agents, the words that break conversion are entities — emails, phone numbers, organizations, names, and medical terms. The Missed Entity Rate (MER) measures how often a model fails to correctly transcribe these high-stakes terms. See Missed Entity Rate for the full definition.

Universal-3 Pro Streaming delivers double-digit relative gains over Universal Streaming across most categories, with the biggest improvements on emails, locations, and medical terms.

Entity typeUniversal-3 Pro MER (%)Universal Streaming MER (%)Relative gain vs Universal Streaming
Job titles8.74%10.13%13.7%
Dates and times8.30%9.91%16.2%
Locations9.22%12.99%29.0%
Medical terms14.78%19.61%24.6%
Organization names17.06%21.41%20.3%
Phone numbers34.79%37.11%6.3%
Email addresses59.64%89.09%33.1%

Source: AssemblyAI Streaming product page.

Latency

Most recent update: May 2026.

These figures are measured using in-domain voice agent conversation data from our internal evaluation set. In production, actual latency will vary depending on audio characteristics, silence duration, and turn detection configuration. Latency claims referenced elsewhere in our documentation may reflect different evaluation conditions or measurement approaches.

If you have questions about benchmarking methodology or need help interpreting these numbers for your use case, reach out to our team at support@assemblyai.com.

Emission latency

Time from when a word is spoken to when that word is returned by the API. Best metric for use cases that consume partial transcripts in real time.

ModelP50 (ms)P90 (ms)
Universal Streaming English317597
Universal Streaming Multilingual303535
Universal-3 Pro StreamingN/AN/A

Universal-3 Pro Streaming does not emit partial transcripts, so emission latency is not applicable.

Time to Complete Transcript (TTCT)

Time from end of speaker turn to receipt of the finalized transcript. Best metric for voice agent responsiveness.

ModelP50 (ms)P90 (ms)
Universal Streaming English649990
Universal-3 Pro Streaming568829

Time to First Token (TTFT) is a useful metric for LLMs but is not well suited to measuring streaming STT performance. For streaming STT, use emission latency (for partial transcript speed) and TTCT (for voice agent responsiveness).

Latency gaming

In streaming, speed is critical. To achieve lower TTFT (time to first token) metrics, some providers emit tokens before any audio is actually spoken. These early tokens are hallucinations designed to game the benchmark, making TTFT a misleading measure of actual latency.

External benchmarks

For third-party streaming benchmarks, we recommend the Coval Speech-to-Text Playground.

Methodology

Our benchmarks are evaluated across 250+ hours of audio data, 80,000+ audio files, and 26 datasets. We apply standard text normalization before calculating metrics. For full details on our methodology, visit assemblyai.com/benchmarks.

Run your own benchmark

We’d be happy to help. AssemblyAI has a benchmarking tool to help you run a custom evaluation against your real audio files. Contact us for more information.

You can also run your own benchmarks following the Hugging Face framework which provides a GitHub repo with full instructions.