Industry’s most accurate Speech AI models

Examine the performance of our Speech AI models across key metrics including accuracy, word error rate, and more.

Highest Word Accuracy Rate

AssemblyAI’s Universal model leads in accuracy, and is up to 40% more accurate than other speech-to-text models.

English
Spanish
German
AssemblyAI
English
93.4%
Spanish
94.7%
German
92.7%
Amazon
English
89.7%
Spanish
93.8%
German
90.9%
Google
English
90.8%
Spanish
86.2%
German
92.7%
Microsoft
English
91.2%
Spanish
92.9%
German
91.8%
Deepgram
English
91.0%
Spanish
93.3%
German
89.4%
OpenAI
English
92.1%
Spanish
94.5%
German
92.2%
80%
85%
90%
Dataset
AssemblyAI
Universal
Amazon
Amazon Transcribe
Google
Latest-long
Microsoft
Azure Batch v3.1
Deepgram
Nova 2
OpenAI
Whisper
English
93.4%
89.7%
90.8%
91.2%
91.0%
92.1%
Spanish
94.7%
93.8%
91.0%
92.9%
93.3%
94.5%
German
92.7%
90.9%
86.2%
91.8%
89.4%
92.2%
Average across all datasets

Lowest Word Error Rate

Fewer errors are critical to building successful AI applications around voice data—including summaries, customer insights, metadata tagging, action items, and more.

English
Spanish
German
AssemblyAI
English
6.6%
Spanish
5.3%
German
7.3%
Amazon
English
10.3%
Spanish
6.2%
German
9.1%
Google
English
9.2%
Spanish
9.0%
German
13.8%
Microsoft
English
8.8%
Spanish
7.1%
German
8.2%
Deepgram
English
9.0%
Spanish
6.7%
German
10.6%
OpenAI
English
7.9%
Spanish
5.5%
German
7.8%
0%
4%
8%
Dataset
AssemblyAI
Universal
Amazon
Amazon Transcribe
Google
Latest-long
Microsoft
Azure Batch v3.1
Deepgram
Nova 2
OpenAI
Whisper
English
6.6%
10.3%
9.2%
8.8%
9.0%
7.9%
Spanish
5.3%
6.2%
9.0%
7.1%
6.7%
5.5%
German
7.3%
9.1%
13.8%
8.2%
10.6%
7.8%
Average across all datasets

Consecutive Error Types per Audio Hour

Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.

Fabrications
Omissions
Hallucinations
AssemblyAI
English
6.6%
Omissions
5.3%
Hallucinations
7.3%
OpenAI
English
7.9%
Omissions
5.5%
Hallucinations
7.8%
0%
4%
8%
Metrics
AssemblyAI
Universal
OpenAI
Whisper
English

Automatically detect and replace profanity in the transcription text.

6.6%
7.9%
Omissions

Automatically detect and replace profanity in the transcription text.

5.3%
5.5%
Hallucinations

Automatically detect and replace profanity in the transcription text.

7.3%
7.8%
Average across all datasets

Consecutive Error Types per Audio Hour

Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.

Ground-truth
AssemblyAI
Universal
OpenAI
Whisper
her jewelry shimmered
her jewelry shimmering
hadja luis sima addjilu sime subtitles by the amara org community
the taebaek mountain chain is often considered the backbone of the korean peninsula
the tabet mountain chain is often considered the backbone of the korean venezuela
the ride to price inte i daseline is about 3 feet tall and suites sizes is 하루
the englishman said nothing
there's an englishman said nothing
does that mean we should not have interessant n
not in a month of sundays
marine a month of sundays
this time i am very happy and then thank you to my co workers get them back to jack corn again thank you to all of you who supported me the job you gave me ultimately gave me nothing however i thank all of you for supporting me thank you to everyone at jack corn thank you to michael john song trabalhar significant
Average across all datasets

English Word Error Rate per dataset

Dataset
AssemblyAI
Universal
Amazon
Amazon Transcribe
Google
Latest-long
Microsoft
Azure Batch v3.1
Deepgram
Nova 2
OpenAI
Whisper
CommonVoice v5.1
6.67%
8.98%
17.59%
7.81%
12.43%
8.83%
Meanwhile
4.77%
7.27%
11.67%
6.73%
5.56%
9.75%
Noisy
10.28%
27.62%
26.63%
16.13%
16.03%
11.86%
Podcast
8.36%
10.55%
14.22%
9.74%
9.12%
8.93%
Telephony (internal)
11.02%
16.08%
23.83%
16.04%
13.22%
12.55%
LibriSpeech Clean
1.72%
2.87%
6.46%
2.74%
3.13%
2.22%
LibriSpeech Test-Other
3.08%
6.64%
13.20%
6.15%
7.36%
4.09%
Broadcast (internal)
4.24%
5.92%
8.61%
6.01%
5.53%
4.55%
Earnings 2021
9.43%
8.33%
14.56%
7.58%
11.05%
9.55%
CORAAL
16.20%
19.62%
33.71%
17.97%
16.86%
19.08%
TEDLIUM
7.21%
9.12%
11.69%
9.27%
8.98%
7.30%
Average
7.54%
11.18%
16.56%
9.65%
9.93%
8.97%

Spanish Word Error Rate per dataset

Dataset
AssemblyAI
Universal
Amazon
Amazon Transcribe
Google
Latest-long
Microsoft
Azure Batch v3.1
Deepgram
Nova 2
OpenAI
Whisper
CommonVoice v9
4.0%
4.3%
7.1%
6.4%
7.3%
4.7%
Private
6.6%
6.9%
12.3%
9.5%
7.4%
7.3%
Multilingual LS
3.9%
4.1%
9.9%
6.1%
4.3%
4.3%
Voxpopuli
8.2%
8.7%
10.7%
8.9%
8.8%
8.6%
Fleurs
3.6%
7.1%
5.0%
4.8%
5.8%
2.8%
Average
5.3%
6.2%
9.0%
7.1%
6.7%
5.5%

Spanish Word Error Rate per dataset

Dataset
AssemblyAI
Universal
Amazon
Amazon Transcribe
Google
Latest-long
Microsoft
Azure Batch v3.1
Deepgram
Nova 2
OpenAI
Whisper
CommonVoice v9
4.2%
5.7%
10.3%
6.0%
9.2%
5.9%
Private
8.2%
10.6%
15.2%
9.2%
10.3%
8.5%
Voxpopuli
12.6%
14.7%
17.4%
12.5%
14.8%
11.2%
Fleurs
7.9%
11.3%
12.1%
8.8%
11.5%
7.3%
Multilingual LS
3.8%
3.2%
14.1%
4.5%
7.2%
6.2%
Average
7.3%
9.1%
13.8%
8.2%
10.6%
7.8%

Benchmark Report Methodology

250+ hours of audio data
80,000+ audio files
26 datasets
Datasets

This benchmark was performed using 3 open-source datasets (LibriSpeech, Rev16, and Meanwhile) and 4 in-house datasets curated by AssemblyAI. For the in-house datasets, we sourced 60+ hours of human-labeled audio data covering popular speech domains such as call centers, podcasts, broadcasts, and webinars. Collectively, these datasets comprise a diverse set of English audio that spans phone calls, broadcasts, accented speech, and heavy jargon.

Methodology

We measured the performance of each provider on 7 datasets. For each vendor, we made API calls to their most accurate model for each file, and for Whisper we generated outputs using a self-hosted instance of Whisper-Large-V3. After receiving results for each file from each vendor, we normalized both the prediction from the model and the ground truth transcript using the open-source Whisper Normalizer. From there, we calculated the average metrics for each file across datasets to measure performance.

Submit your email to download a PDF of the benchmark results

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Turn voice data into unparalleled product experiences

Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.