Industry’s most accurate Speech AI models
Examine the performance of our Speech AI models across key metrics including accuracy, word error rate, and more.
Highest Word Accuracy Rate
AssemblyAI’s Universal model leads in accuracy, and is up to 40% more accurate than other speech-to-text models.
Dataset | AssemblyAI Universal | Amazon Amazon Transcribe | Google Latest-long | Microsoft Azure Batch v3.1 | Deepgram Nova 2 | OpenAI Whisper |
---|---|---|---|---|---|---|
English | 93.4% | 89.7% | 90.8% | 91.2% | 91.0% | 92.1% |
Spanish | 94.7% | 93.8% | 91.0% | 92.9% | 93.3% | 94.5% |
German | 92.7% | 90.9% | 86.2% | 91.8% | 89.4% | 92.2% |
Lowest Word Error Rate
Fewer errors are critical to building successful AI applications around voice data—including summaries, customer insights, metadata tagging, action items, and more.
Dataset | AssemblyAI Universal | Amazon Amazon Transcribe | Google Latest-long | Microsoft Azure Batch v3.1 | Deepgram Nova 2 | OpenAI Whisper |
---|---|---|---|---|---|---|
English | 6.6% | 10.3% | 9.2% | 8.8% | 9.0% | 7.9% |
Spanish | 5.3% | 6.2% | 9.0% | 7.1% | 6.7% | 5.5% |
German | 7.3% | 9.1% | 13.8% | 8.2% | 10.6% | 7.8% |
Consecutive Error Types per Audio Hour
Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.
Metrics | AssemblyAI Universal | OpenAI Whisper |
---|---|---|
English Automatically detect and replace profanity in the transcription text. | 6.6% | 7.9% |
Omissions Automatically detect and replace profanity in the transcription text. | 5.3% | 5.5% |
Hallucinations Automatically detect and replace profanity in the transcription text. | 7.3% | 7.8% |
Consecutive Error Types per Audio Hour
Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.
Ground-truth | AssemblyAI Universal | OpenAI Whisper |
---|---|---|
her jewelry shimmered | her jewelry shimmering | hadja luis sima addjilu sime subtitles by the amara org community |
the taebaek mountain chain is often considered the backbone of the korean peninsula | the tabet mountain chain is often considered the backbone of the korean venezuela | the ride to price inte i daseline is about 3 feet tall and suites sizes is 하루 |
the englishman said nothing | there's an englishman said nothing | does that mean we should not have interessant n |
not in a month of sundays | marine a month of sundays | this time i am very happy and then thank you to my co workers get them back to jack corn again thank you to all of you who supported me the job you gave me ultimately gave me nothing however i thank all of you for supporting me thank you to everyone at jack corn thank you to michael john song trabalhar significant |
English Word Error Rate per dataset
Dataset | AssemblyAI Universal | Amazon Amazon Transcribe | Google Latest-long | Microsoft Azure Batch v3.1 | Deepgram Nova 2 | OpenAI Whisper |
---|---|---|---|---|---|---|
CommonVoice v5.1 | 6.67% | 8.98% | 17.59% | 7.81% | 12.43% | 8.83% |
Meanwhile | 4.77% | 7.27% | 11.67% | 6.73% | 5.56% | 9.75% |
Noisy | 10.28% | 27.62% | 26.63% | 16.13% | 16.03% | 11.86% |
Podcast | 8.36% | 10.55% | 14.22% | 9.74% | 9.12% | 8.93% |
Telephony (internal) | 11.02% | 16.08% | 23.83% | 16.04% | 13.22% | 12.55% |
LibriSpeech Clean | 1.72% | 2.87% | 6.46% | 2.74% | 3.13% | 2.22% |
LibriSpeech Test-Other | 3.08% | 6.64% | 13.20% | 6.15% | 7.36% | 4.09% |
Broadcast (internal) | 4.24% | 5.92% | 8.61% | 6.01% | 5.53% | 4.55% |
Earnings 2021 | 9.43% | 8.33% | 14.56% | 7.58% | 11.05% | 9.55% |
CORAAL | 16.20% | 19.62% | 33.71% | 17.97% | 16.86% | 19.08% |
TEDLIUM | 7.21% | 9.12% | 11.69% | 9.27% | 8.98% | 7.30% |
Average | 7.54% | 11.18% | 16.56% | 9.65% | 9.93% | 8.97% |
Spanish Word Error Rate per dataset
Dataset | AssemblyAI Universal | Amazon Amazon Transcribe | Google Latest-long | Microsoft Azure Batch v3.1 | Deepgram Nova 2 | OpenAI Whisper |
---|---|---|---|---|---|---|
CommonVoice v9 | 4.0% | 4.3% | 7.1% | 6.4% | 7.3% | 4.7% |
Private | 6.6% | 6.9% | 12.3% | 9.5% | 7.4% | 7.3% |
Multilingual LS | 3.9% | 4.1% | 9.9% | 6.1% | 4.3% | 4.3% |
Voxpopuli | 8.2% | 8.7% | 10.7% | 8.9% | 8.8% | 8.6% |
Fleurs | 3.6% | 7.1% | 5.0% | 4.8% | 5.8% | 2.8% |
Average | 5.3% | 6.2% | 9.0% | 7.1% | 6.7% | 5.5% |
Spanish Word Error Rate per dataset
Dataset | AssemblyAI Universal | Amazon Amazon Transcribe | Google Latest-long | Microsoft Azure Batch v3.1 | Deepgram Nova 2 | OpenAI Whisper |
---|---|---|---|---|---|---|
CommonVoice v9 | 4.2% | 5.7% | 10.3% | 6.0% | 9.2% | 5.9% |
Private | 8.2% | 10.6% | 15.2% | 9.2% | 10.3% | 8.5% |
Voxpopuli | 12.6% | 14.7% | 17.4% | 12.5% | 14.8% | 11.2% |
Fleurs | 7.9% | 11.3% | 12.1% | 8.8% | 11.5% | 7.3% |
Multilingual LS | 3.8% | 3.2% | 14.1% | 4.5% | 7.2% | 6.2% |
Average | 7.3% | 9.1% | 13.8% | 8.2% | 10.6% | 7.8% |
Benchmark Report Methodology
This benchmark was performed using 3 open-source datasets (LibriSpeech, Rev16, and Meanwhile) and 4 in-house datasets curated by AssemblyAI. For the in-house datasets, we sourced 60+ hours of human-labeled audio data covering popular speech domains such as call centers, podcasts, broadcasts, and webinars. Collectively, these datasets comprise a diverse set of English audio that spans phone calls, broadcasts, accented speech, and heavy jargon.
We measured the performance of each provider on 7 datasets. For each vendor, we made API calls to their most accurate model for each file, and for Whisper we generated outputs using a self-hosted instance of Whisper-Large-V3. After receiving results for each file from each vendor, we normalized both the prediction from the model and the ground truth transcript using the open-source Whisper Normalizer. From there, we calculated the average metrics for each file across datasets to measure performance.
Submit your email to download a PDF of the benchmark results
Turn voice data into unparalleled product experiences
Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.
