For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PlaygroundChangelogSign In
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
  • Getting started
    • Transcribe a pre-recorded audio file
    • Model selection
    • View model benchmarks
    • Evaluate model accuracy
    • Cloud endpoints & data residency
    • Manage concurrent requests
    • Webhooks
  • Models
    • Medical Mode
  • Features
    • Boost specific terms
    • Label speakers
    • Transcribe multiple audio channels
    • Transcribe audio with mixed languages
    • Correct spelling of terms
    • Include filler words
    • Search for words in transcript
    • Set the start and end of the transcript
  • Guides
LogoLogo
PlaygroundChangelogSign In
On this page
  • Word error rate (WER)
  • English benchmarks
  • Multilingual benchmarks
  • Missed entity rate
  • Hallucinations and consecutive errors
  • Benchmark challenges
  • External benchmarks
  • Methodology
  • Run your own benchmark
Getting started

Benchmarks

Industry-leading accuracy for pre-recorded speech-to-text.

Was this page helpful?
Previous

Evaluating Pre-recorded STT models

Next
Built with

Benchmarks are an important first step before running your own evaluation. Below are the current benchmarks for our pre-recorded models so you can assess performance across accuracy, latency, and error rates.

Public benchmarks can be misleading due to overfitting and benchmark gaming. We strongly recommend running your own evaluation on your audio data to identify the best model for your use case.

For the full interactive benchmark experience with competitive comparisons, visit assemblyai.com/benchmarks.

Word error rate (WER)

Word Error Rate (WER) is the classical metric for speech-to-text accuracy. It counts substitutions, deletions, and insertions against a reference transcript, divided by the total word count in the ground truth. AssemblyAI Universal-3 Pro achieves a mean WER of 5.6% (median 4.9%) on English benchmarks.

WER weights every word equally, so a misrecognized filler word counts the same as a misrecognized email address or medication name. For production voice workflows, we recommend pairing WER with Missed entity rate, which measures accuracy on the high-stakes entities — names, emails, phone numbers, and medical terms — that actually drive end-user outcomes.

English benchmarks

Most recent update: January 2026.

DatasetUniversal-3 Pro WER (%)Universal-2 WER (%)Relative gain vs Universal-2
Overall PerformanceMean: 5.6% | Median: 4.9%Mean: 6.1% | Median: 6.5%Mean: 8.2% | Median: 24.6%
commonvoice4.87%6.48%24.8%
earnings218.80%9.37%6.1%
librispeech_test_clean1.52%1.68%9.5%
librispeech_test_other2.69%3.00%10.3%
meanwhile4.22%4.41%4.3%
tedlium6.77%7.30%7.3%
rev1610.29%10.32%0.3%

Multilingual benchmarks

Most recent update: January 2026. Dataset: FLEURS.

Language CodeLanguageUniversal-3 Pro WER (%)Universal-2 WER (%)Relative gain vs Universal-2
AverageAll4.58%7.42%38.3%
deGerman4.88%6.22%21.5%
enEnglish-4.38%-
esSpanish3.98%4.56%12.7%
fiFinnish-10.10%-
frFrench4.98%7.56%34.1%
hiHindi-7.38%-
itItalian3.69%4.75%22.3%
jaJapanese-7.79%-
koKorean-14.54%-
nlDutch-7.79%-
plPolish-6.63%-
ptPortuguese5.39%5.98%9.9%
ruRussian-5.80%-
trTurkish-8.12%-
ukUkrainian-7.42%-
viVietnamese-9.75%-

Missed entity rate

For production voice workflows, the actual words that matter most are entities — names, organizations, emails, phone numbers, and medical terms. The Missed Entity Rate (MER) measures how often a model fails to correctly transcribe these high-stakes terms. See Missed Entity Rate for the full definition.

Universal-3 Pro delivers relative gains over Universal-2 across every entity category we track for voice workflows, with the largest improvements on emails, locations, and medical terms.

Entity typeUniversal-3 Pro MER (%)Universal-2 MER (%)Relative gain vs Universal-2
Medical terms13.15%18.43%28.6%
Locations8.61%12.40%30.6%
Job titles9.03%9.86%8.4%
Organization names17.02%20.96%18.8%
Email addresses33.76%53.81%37.3%
Phone numbers13.14%14.69%10.6%
Credit card numbers21.83%25.07%12.9%

Hallucinations and consecutive errors

Hallucinations are a critical concern in production STT systems. AssemblyAI reduces hallucinations by 30% compared to Whisper, across three error categories:

  • Fabrications — words inserted that were never spoken
  • Omissions — spoken words that are missing from the transcript
  • Hallucinations — extended sequences of fabricated content

Benchmark challenges

Models are often trained on publicly available datasets — sometimes the very same datasets used for evaluation. When this happens, the model becomes overfit to the evaluation set and will show artificially strong performance on standard WER tests. This makes WER potentially misleading, as real-world performance on unseen audio will be significantly worse.

External benchmarks

For third-party benchmarks, we recommend the Hugging Face ASR Leaderboard. Note that many models listed require self-hosting and lack production features like speaker diarization and automatic language detection.

Methodology

Our benchmarks are evaluated across 250+ hours of audio data, 80,000+ audio files, and 26 datasets. We apply standard text normalization before calculating metrics. For full details on our methodology, visit assemblyai.com/benchmarks.

Run your own benchmark

We’d be happy to help. AssemblyAI has a benchmarking tool to help you run a custom evaluation against your real audio files. Contact us for more information.

You can also run your own benchmarks following the Hugging Face framework which provides a GitHub repo with full instructions.