Getting started

Benchmarks

Industry-leading accuracy across pre-recorded and streaming speech-to-text.

Benchmarks are an important first step before running your own evaluation. Below are the current benchmarks for our models so you can assess performance across accuracy, latency, and error rates.

Public benchmarks can be misleading due to overfitting and benchmark gaming. We strongly recommend running your own evaluation on your audio data to identify the best model for your use case.

For the full interactive benchmark experience with competitive comparisons, visit assemblyai.com/benchmarks.

Pre-recorded speech-to-text

Word accuracy and error rate

AssemblyAI Universal-3 Pro achieves a mean WER of 6.2% (median 6.5%) on English benchmarks, with a hallucination rate of 0.58%.

English benchmarks

Most recent update: October 2025.

DatasetWER (%)Hallucination Rate (%)
Overall PerformanceMean: 6.2% | Median: 6.5%0.58%
commonvoice6.51%-
earnings219.44%-
librispeech_test_clean1.88%-
librispeech_test_other3.10%-
meanwhile4.48%-
tedlium7.28%-
rev1610.42%-

Multilingual benchmarks

Most recent update: June 2025. Dataset: FLEURS.

Language CodeLanguageWER (%)
AverageAll Languages6.76%
deGerman4.99%
enEnglish4.38%
esSpanish2.95%
fiFinnish10.10%
frFrench7.71%
hiHindi7.38%
itItalian3.29%
jaJapanese7.79%
koKorean14.54%
nlDutch7.79%
plPolish6.63%
ptPortuguese4.80%
ruRussian5.80%
trTurkish8.12%
ukUkrainian7.42%
viVietnamese9.75%

Hallucinations and consecutive errors

Hallucinations are a critical concern in production STT systems. AssemblyAI reduces hallucinations by 30% compared to Whisper, across three error categories:

  • Fabrications — words inserted that were never spoken
  • Omissions — spoken words that are missing from the transcript
  • Hallucinations — extended sequences of fabricated content

Benchmark challenges

Models are often trained on publicly available datasets — sometimes the very same datasets used for evaluation. When this happens, the model becomes overfit to the evaluation set and will show artificially strong performance on standard WER tests. This makes WER potentially misleading, as real-world performance on unseen audio will be significantly worse.

External benchmarks

For third-party benchmarks, we recommend the Hugging Face ASR Leaderboard. Note that many models listed require self-hosting and lack production features like speaker diarization and automatic language detection.

Streaming speech-to-text

English benchmarks

Most recent update: October 2025.

DatasetWER (%)Emission Latency (ms)
Overall PerformanceMean: 8.5% | Median: 7.8%Median: 256.41ms | P90: 579ms
commonvoice11.81%-
earnings2112.37%-
librispeech_test_clean2.71%-
librispeech_test_other5.82%-
meanwhile6.73%-
tedlium7.81%-
rev1612.99%-

Multilingual benchmarks

Most recent update: November 2025.

Language CodeLanguageWER (%)Emission Latency (ms)
AverageAll Languages11.58%Median: 265ms | P90: 499ms
enEnglish12.94%-
esSpanish9.81%-
deGerman13.99%-
frFrench16.53%-
itItalian7.36%-
ptPortuguese9.83%-

Latency gaming

In streaming, speed is critical. To achieve lower TTFT (time to first token) metrics, some providers emit tokens before any audio is actually spoken. These early tokens are hallucinations designed to game the benchmark, making TTFT a misleading measure of actual latency.

External benchmarks

For third-party streaming benchmarks, we recommend the Coval Speech-to-Text Playground.

Methodology

Our benchmarks are evaluated across 250+ hours of audio data, 80,000+ audio files, and 26 datasets. We apply standard text normalization before calculating metrics. For full details on our methodology, visit assemblyai.com/benchmarks.

Run your own benchmark

We’d be happy to help. AssemblyAI has a benchmarking tool to help you run a custom evaluation against your real audio files. Contact us for more information.

You can also run your own benchmarks following the Hugging Face framework which provides a GitHub repo with full instructions.