Benchmarks
Industry-leading accuracy for streaming speech-to-text.
Benchmarks are an important first step before running your own evaluation. Below are the current benchmarks for our streaming models so you can assess performance across accuracy, latency, and error rates.
Public benchmarks can be misleading due to overfitting and benchmark gaming. We strongly recommend running your own evaluation on your audio data to identify the best model for your use case.
For the full interactive benchmark experience with competitive comparisons, visit assemblyai.com/benchmarks.
Word error rate (WER)
Word Error Rate (WER) is the classical metric for speech-to-text accuracy. It counts substitutions, deletions, and insertions against a reference transcript, divided by the total word count in the ground truth.
WER weights every word equally, so a misrecognized filler word counts the same as a misrecognized email or phone number. For voice agents, we recommend pairing WER with Missed entity rate, which measures accuracy on the high-stakes entities — names, emails, phone numbers, and medical terms — that break agent conversion in production.
English benchmarks
Most recent update: March 2026.
Multilingual benchmarks
Most recent update: March 2026.
Missed entity rate
For voice agents, the words that break conversion are entities — emails, phone numbers, organizations, names, and medical terms. The Missed Entity Rate (MER) measures how often a model fails to correctly transcribe these high-stakes terms. See Missed Entity Rate for the full definition.
Universal-3 Pro delivers large improvements over Universal Streaming on the entity types that matter most for voice agents, with the biggest gains on emails and medical terms.
Latency gaming
In streaming, speed is critical. To achieve lower TTFT (time to first token) metrics, some providers emit tokens before any audio is actually spoken. These early tokens are hallucinations designed to game the benchmark, making TTFT a misleading measure of actual latency.
External benchmarks
For third-party streaming benchmarks, we recommend the Coval Speech-to-Text Playground.
Methodology
Our benchmarks are evaluated across 250+ hours of audio data, 80,000+ audio files, and 26 datasets. We apply standard text normalization before calculating metrics. For full details on our methodology, visit assemblyai.com/benchmarks.
Run your own benchmark
We’d be happy to help. AssemblyAI has a benchmarking tool to help you run a custom evaluation against your real audio files. Contact us for more information.
You can also run your own benchmarks following the Hugging Face framework which provides a GitHub repo with full instructions.