Industry’s most accurate Speech AI models

Examine the performance of our Speech AI models across key metrics including accuracy, word error rate, and more.

Highest accuracy

AssemblyAI’s Conformer-2 model leads in accuracy, and is up to 40% more accurate than other models.

See all benchmarks

Word Accuracy Rate

Word Information Preservation Rate

80%

90%

95%

AssemblyAI

Whisper

Azure

Deepgram

Rev AI

AWS

Google

Metrics

AssemblyAI

Conformer-2

Whisper

Large V3

Azure Cognitive Services STT

Deepgram

Nova 2

Rev AI

AWS Transcribe

Google

Latest

Word Accuracy Rate

93.06%

91.87%

91.59%

90.96%

90.72%

89.98%

88.36%

Word Information Preservation Rate

90.62%

90.40%

88.91%

88.35%

87.95%

87.18%

84.02%

Average across all datasets

Lowest errors

Fewer errors are critical to building successful AI applications around voice data—including summaries, metadata tagging, action items, and more.

Word Error Rate

Hallucinations

Substitutions

0%

4%

8%

12%

AssemblyAI

Whisper

Azure

Deepgram

Rev AI

AWS

Google

Metrics

AssemblyAI

Conformer-2

Whisper

Large V3

Azure Cognitive Services STT

Deepgram

Nova 2

Rev AI

AWS Transcribe

Google

Latest

Word Error Rate

6.94%

7.29%

8.41%

9.04%

9.28%

10.02%

11.64%

Hallucination Rate

2.10%

2.43%

3.58%

3.74%

4.12%

3.01%

3.40%

Substitution Rate

2.94%

2.84%

3.54%

3.14%

3.92%

3.61%

5.56%

Average across all datasets

Latency

Lower latency results in faster turnaround times. Learn about AssemblyAI's recent latency updates and commitment to continuous improvement.

15 min file

30 min file

60 min file

120 min file

0s

500s

1000s

1500s

Deepgram

AssemblyAI

Large V3

Rev AI

AWS

Google

Azure

File Audio Duration

Deepgram

Nova 2

AssemblyAI

Conformer-2

Whisper

Large V3

Rev AI

AWS Transcribe

Google

Latest

Azure Cognitive Services STT

15 minutes

3

23

17

83

46

193

193

30 minutes

7

37

58

78

115

436

444

60 minutes

9

45

124

122

137

834

962

120 minutes

22

87

303

183

266

1658

2089

Average turnaround time in seconds

Benchmark overview

Explore additional benchmarks to see how AssemblyAI compares to other models.

Metrics

AssemblyAI

Conformer-2

Whisper

Large V3

Google

Latest

AWS Transcribe

Azure Cognitive Services STT

Deepgram

Nova 2

Rev AI

Broadcast: Public broadcast data

Word Error Rate

3.97%

4.53%

8.15%

6.13%

6.04%

6.25%

5.74%

Word Accuracy Rate

96.03%

95.47%

91.85%

93.87%

93.96%

93.75%

94.26%

Match Error Rate

3.93%

4.44%

7.92%

5.81%

5.73%

6.04%

5.52%

Word Information Loss

5.41%

5.99%

11.14%

7.66%

7.65%

8.00%

7.20%

Word Information Preservation Rate

94.59%

94.01%

88.86%

92.38%

92.27%

92.00%

92.80%

Deletion Rate

1.53%

1.18%

2.15%

1.92%

1.18%

1.25%

0.62%

Hallucination Rate

0.94%

1.76%

2.60%

2.16%

2.79%

2.96%

3.35%

Substitution Rate

1.51%

1.60%

3.40%

1.90%

1.99%

2.04%

1.76%

Webinar: Webinar recordings

Word Error Rate

4.86%

6.44%

10.91%

10.84%

10.19%

10.13%

9.98%

Word Accuracy Rate

95.14%

93.56%

89.09%

89.16%

89.81%

89.87%

90.02%

Match Error Rate

4.75%

6.21%

10.21%

10.09%

9.57%

9.37%

9.21%

Word Information Loss

6.28%

8.01%

13.34%

13.01%

12.19%

11.50%

11.18%

Word Information Preservation Rate

93.72%

91.99%

86.66%

87.59%

88.87%

88.50%

88.82%

Deletion Rate

1.42%

1.36%

1.51%

0.82%

1.00%

0.90%

0.63%

Hallucination Rate

1.88%

3.22%

6.03%

6.71%

5.88%

6.94%

7.24%

Substitution Rate

1.56%

1.86%

3.37%

3.17%

2.80%

2.29%

2.11%

Noisy: Audio files from noisy environments

Word Error Rate

11.94%

13.12%

21.54%

27.53%

16.44%

17.17%

27.53%

Word Accuracy Rate

88.06%

86.88%

78.46%

72.47%

83.56%

82.83%

72.47%

Match Error Rate

11.20%

12.40%

20.26%

26.30%

15.14%

15.76%

26.30%

Word Information Loss

14.49%

16.38%

27.02%

31.42%

15.14%

20.75%

31.42%

Word Information Preservation Rate

85.51%

83.62%

72.98%

68.44%

79.52%

79.25%

68.44%

Deletion Rate

4.54%

3.49%

7.72%

15.30%

2.99%

3.79%

15.30%

Hallucination Rate

3.84%

5.33%

6.18%

6.49%

7.67%

7.77%

6.49%

Substitution Rate

3.56%

4.29%

7.64%

5.73%

5.77%

5.61%

5.73%

Meanwhile: 64 Segments from the Late Show with Stephen Colbert

Word Error Rate

6.75%

9.28%

10.01%

7.29%

6.23%

6.49%

7.73%

Word Accuracy Rate

93.25%

90.72%

89.99%

92.71%

93.77%

93.51%

92.27%

Match Error Rate

6.67%

9.01%

9.87%

7.18%

6.16%

6.41%

7.62%

Word Information Loss

10.43%

13.33%

15.78%

11.46%

10.08%

9.79%

12.46%

Word Information Preservation Rate

89.57%

86.67%

84.22%

88.53%

89.79%

90.21%

87.54%

Deletion Rate

1.68%

2.33%

2.28%

1.47%

1.25%

1.92%

1.39%

Hallucination Rate

1.09%

2.30%

1.28%

1.18%

4.19%

0.98%

1.10%

Substitution Rate

3.98%

4.65%

6.46%

4.59%

4.19%

3.59%

5.23%

Rev16: A collection of 16 podcasts from Rev.AI’s Podcast Transcription Benchmark

Word Error Rate

10.77%

9.23%

11.47%

8.68%

9.50%

10.39%

8.46%

Word Accuracy Rate

89.23%

90.77%

88.53%

91.32%

90.50%

89.61%

91.54%

Match Error Rate

10.59%

9.01%

11.24%

8.49%

9.26%

9.99%

8.09%

Word Information Loss

13.72%

11.83%

15.90%

11.70%

12.82%

13.64%

11.44%

Word Information Preservation Rate

86.28%

88.17%

84.10%

87.46%

86.83%

86.36%

88.56%

Deletion Rate

6.20%

4.56%

4.77%

3.94%

3.96%

3.63%

2.17%

Hallucination Rate

1.30%

1.73%

1.73%

1.78%

1.89%

2.87%

2.70%

Substitution Rate

3.27%

2.94%

4.98%

3.37%

3.77%

3.89%

3.58%

LibriSpeech (Test, Clean): English speech that is acoustically clean or relatively free from background noise and distortion

Word Error Rate

3.58%

2.49%

5.85%

2.89%

2.82%

3.34%

5.21%

Word Accuracy Rate

96.42%

97.51%

94.15%

97.11%

97.18%

96.66%

94.79%

Match Error Rate

3.45%

2.41%

5.69%

2.75%

2.75%

3.28%

5.03%

Word Information Loss

5.50%

3.73%

9.21%

4.44%

4.40%

5.16%

8.16%

Word Information Preservation Rate

94.50%

96.27%

90.79%

95.59%

95.64%

94.84%

91.84%

Deletion Rate

0.60%

0.31%

0.85%

0.28%

0.42%

0.63%

0.60%

Hallucination Rate

0.43%

0.55%

0.64%

0.50%

0.34%

0.32%

0.80%

Substitution Rate

2.54%

1.62%

4.36%

2.09%

2.05%

2.40%

3.81%

LibriSpeech (Test, Other): English speech that is more acoustically challenging — higher levels of background noise, varying accents, or other acoustic peculiarities

Word Error Rate

6.72%

5.03%

12.60%

6.62%

6.51%

8.63%

10.19%

Word Accuracy Rate

93.28%

94.97%

87.40%

93.38%

93.49%

91.37%

89.81%

Match Error Rate

6.51%

4.84%

12.11%

6.31%

6.27%

8.33%

9.87%

Word Information Loss

10.43%

7.60%

19.00%

10.11%

10.07%

12.35%

15.39%

Word Information Preservation Rate

89.57%

92.40%

81.00%

89.86%

89.90%

87.65%

84.61%

Deletion Rate

0.94%

0.51%

1.66%

0.59%

0.69%

2.08%

1.66%

Hallucination Rate

0.76%

0.92%

1.42%

0.95%

0.84%

0.84%

1.15%

Substitution Rate

5.01%

3.61%

9.53%

5.10%

4.99%

5.71%

7.38%

Audio: A broad set of audio data including phone calls, radio shows, & more

Word Error Rate

6.91%

8.20%

12.54%

10.21%

9.57%

9.94%

9.40%

Word Accuracy Rate

93.09%

91.80%

87.46%

89.79%

90.43%

90.06%

90.60%

Match Error Rate

6.73%

7.89%

11.98%

9.75%

9.06%

9.33%

8.82%

Word Information Loss

8.75%

9.95%

16.42%

12.53%

11.69%

12.04%

11.24%

Word Information Preservation Rate

91.25%

90.05%

83.58%

87.63%

88.49%

87.96%

88.76%

Deletion Rate

2.58%

2.45%

3.37%

2.85%

1.64%

1.46%

1.00%

Hallucination Rate

2.26%

3.63%

4.39%

4.33%

5.01%

5.59%

5.80%

Substitution Rate

2.07%

2.12%

4.78%

2.91%

2.79%

2.90%

2.60%

Benchmark Report Methodology

85+

hours of audio data

5000+

audio files

7

datasets

Datasets

This benchmark was performed using 3 open-source datasets (LibriSpeech, Rev16, and Meanwhile) and 4 in-house datasets curated by AssemblyAI. For the in-house datasets, we sourced 60+ hours of human-labeled audio data covering popular speech domains such as call centers, podcasts, broadcasts, and webinars. Collectively, these datasets comprise a diverse set of English audio that spans phone calls, broadcasts, accented speech, and heavy jargon.

Methodology

We measured the performance of each provider on 7 datasets. For each vendor, we made API calls to their most accurate model for each file, and for Whisper we generated outputs using a self-hosted instance of Whisper-Large-V3. After receiving results for each file from each vendor, we normalized both the prediction from the model and the ground truth transcript using the open-source Whisper Normalizer. From there, we calculated the average metrics for each file across datasets to measure performance.

AssemblyAI combines industry-leading accuracy with a robust feature set

Language Detection

Speaker Labels

Word Timings

Real-time Streaming

Custom Vocabulary

Dual Channel

LLM Text Generation

Profanity Filtering

Speech Threshold

Advanced PII Redaction

Extract maximum value from voice data

AssemblyAI’s models deliver the highest accuracy and an expansive set of capabilities, making it easy to build on top of voice data.

Products overview
Rapid innovation, constant iteration

AssemblyAI ships new features, model improvements, and product updates weekly to ensure you have access to cutting-edge Speech AI capabilities.

View changelog

Submit your email to view
the complete list of benchmark results

Or get started in seconds with the AssemblyAI API

Trusted by thousands of developers in businesses across every industry

"With AssemblyAI, we get a trusted partner that supports us in enhancing the value we deliver for our customers."

Ryan Johnson, Chief Product Officer, CallRail

It’s fast and free to start building with AssemblyAI

1
2
3
4
5
6
7
import assemblyai as aai
import json

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(URL, config)

print(json.dumps(transcript, indent=2))