November 7, 2024

Universal-2 vs OpenAI's Whisper: Comparing Speech-to-Text models in real-world use cases

Comparing Universal-2, Universal-1, and Whispers models at proper noun and alphanumeric detection tasks, text formatting, and hallucinations.

AI Concepts

Universal-2

Patrick Loeber

Senior Developer Advocate

Patrick Loeber

Senior Developer Advocate

Reviewed by

Ryan O'Connor

Ilya Sklyar

Senior Developer Educator

Senior Researcher

Ryan O'Connor

Senior Developer Educator

Ilya Sklyar

Senior Researcher

Table of contents

[Visible on live site]

Get $50 in credits

In this blog post, we'll compare Universal-2, Universal-1, and two Whisper variants (large-v3 and turbo) in terms of their fitness for real-world Speech-to-Text scenarios.

While all models show impressive Speech-to-Text accuracy overall, this comparison focuses on their performance regarding the finer details that are crucial for readable transcripts and downstream tasks:

Proper nouns (e.g. person names, places, brand names)
Alphanumerics (e.g. digits, years, phone numbers)
Text formatting (e.g. upper/lower case, punctuation)
Hallucinations

Compared models

We'll compare the following models:

| **Model** | **Parameters** | **Required VRAM** |:------:|:------:| :------:| |Universal-2| 600 M| - | |Universal-1| 600 M| - | |Whisper large-v3| 1550 M|~10 GB| |Whisper turbo| 809 M|~6 GB|

Universal-2 is AssemblyAI's latest Speech-to-Text model, showing substantial improvements over its predecessor Universal-1, and achieving best-in-class accuracy.

Whisper large-v3 is a popular open-source model created by OpenAI. The Whisper turbo model is a new optimized version offering faster transcription speed with minimal degradation in accuracy compared to large-v3.

Code to run Universal-2 and Whisper

Before we look at the evaluation, let’s quickly see how you can run Universal-2 and the Whisper models in case you want to run your own evaluations. All models can be easily run via their respective SDKs.

Run Universal-2:

# pip install assemblyai
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

transcript = aai.Transcriber().transcribe("./filename.mp3")

print(transcript.text)

Run Whisper:

# pip install openai-whisper
import whisper

whisper_v3 = whisper.load_model("large-v3")
# or
whisper_turbo = whisper.load_model("turbo")

result = whisper_v3.transcribe("./filename.mp3")

print(result["text"])

Note that Universal-2 is the new default model in AssemblyAI's API for English audio, replacing Universal-1. To run it, you'll need a free API key.

The open-source Whisper models require a GPU with sufficient VRAM (see requirements). A free solution you can use is a Google Colab. Make sure to change the runtime type and select a proper GPU or TPU.

You can use this Google Colab to run all models with a few example audio files.

Evaluation datasets

A detailed performance analysis with a breakdown of all evaluation datasets can be found in the Universal-2 research report. The report also includes additional metrics and comparisons against other model providers.

You can read more about how to evaluate speech recognition models in our blog.

Results

Standard ASR accuracy

We begin by measuring the overall accuracy of the model at the word-level to get a general indicator of the performance of each model.

The metric used is WER (Word Error Rate), which counts the total number of mistakes the model makes at the word level and reports it as a proportion of the total number of words in the "ground truth" transcript. A lower WER is better, for example, a model with a 10% WER makes on average 1 mistake every 10 words.

| | Universal-2 |Universal-1 | Whisper large-v3 | Whisper turbo | |---|:---:|:---:|:---:|:---:| | **WER** | **6.68%** | 6.88% | 7.88% | 7.75% |

Universal-2 shows the best proper noun recognition, with a 24% relative reduction in error rate compared to Universal-1. Whisper large-v3 is the second best tested model, with a 11% relative error increase compared to Universal-2. Universal- 1 and Whisper turbo struggle the most with proper nouns.

‍

Alphanumerics

Another factor critical to practical usage scenarios is alphanumerics recognition accuracy. Many real-world audio samples contain sequences of spoken letters and numbers, e.g. phone numbers, ticket numbers, or years. Without accurate transcriptions, incorrect information can get forwarded for downstream processing, e.g., analyzing the audio content with an LLM.

To measure alphanumerics recognition accuracy, we calculated Alphanumerics WER, which is the WER based on a 10-hour dataset created by sampling audio clips rich in alphanumeric content.

| | Universal-2 |Universal-1 | Whisper large-v3 | Whisper turbo | |---|:---:|:---:|:---:|:---:| | **Alphanumerics WER** | 4.00% | 5.06% | **3.84%** | 4.18% |

All models confidently transcribe alphanumerics and outperformed all other tested model providers, with Whisper large-v3 having an edge in this category.

Formatting

For correct transcripts that are easy to read, formatting is also important. Specifically, correct punctuation, capitalization, and Inverse Text Normalization (ITN).

To measure formatting accuracy, we calculated U-WER (Unpunctuated Word Error Rate). U-WER is the word error rate computed over formatted outputs from which punctuation marks are deleted. This metric takes into account Truecasing and ITN accuracy, on top of standard ASR accuracy. We also calculated F-WER (Formatted WER), which is similar to U-WER except it additionally measures punctuation accuracy. Note that F-WER tends to fluctuate more than U-WER, given that correct punctuation is not always uniquely determined.

| | Universal-2 |Universal-1 | Whisper large-v3 | Whisper turbo | |---|:---:|:---:|:---:|:---:| | **U-WER** | **10.04%** | 11.78% | 12.01% | 12.83% | | **F-WER** | **15.14%** | 16.68% | 16.84% | 17.61% |

Universal-2 shows significant improvements over its predecessor and has a clear advantage in this category. Compared to Whisper large-v3 and turbo, Universal-2 shows a 16% and 22% reduction in error rate, respectively. A similar lead for Universal-2 is achieved in F-WER.

Hallucinations

One key quirk that has been observed for Whisper is its increased propensity for hallucinations.

The Whisper large-v3 model, in particular, has exhibited an increased propensity for hallucinations, often manifesting in long contiguous blocks of consecutive transcription errors. In this recent report, a University of Michigan researcher studying public meetings found hallucinations in 8 out of every 10 audio transcriptions.

If you paid close attention to the Whisper transcripts of the above examples (or examined the outputs in the accompanying Google Colab), you'd have noticed that both the alphanumerics and the proper nouns audio examples contain hallucinations towards the end of the transcript. Despite the fact that Whisper shows strong overall standard ASR accuracy, these hallucinations significantly impact its suitability for real-world use cases.

In our evaluations, the Universal models showed a 30% reduction in hallucination rates compared to Whisper large-v3, making it a more reliable choice for many practical Speech-to-Text applications.

Conclusion

Universal-2 emerges as the leading model in most categories:

Best overall accuracy (6.68% WER)
Superior proper noun handling (13.87% PNER)
Best formatting accuracy (10.04% U-WER)
30% reduction in hallucination rates compared to Whisper

It shows significant improvements over its predecessor, which is qualitatively demonstrated by a human preference test where 73% of users – nearly 3 out of 4 people – preferred Universal-2's output over Universal-1.

Whisper large-v3 shows some notable strengths and limitations:

Best alphanumeric transcription accuracy (3.84% WER)
Decent performance across other categories
Requires careful consideration due to documented hallucination issues

Whisper turbo offers a balanced trade-off:

Notable weakness in proper noun detection (18.18% PNER)
Performance close to large-v3 in other metrics. turbo is a good choice over large-v3 when prioritizing speed over accuracy
Ideal for local deployments with limited resources (~6GB VRAM)

For a more in-depth evaluation including additional metrics and comparisons against other model providers, read the Universal-2 research report.

If you're interested in learning more about the process of properly evaluating models in an objective, scientific way, you can read this blog post.

Universal-2 vs OpenAI's Whisper: Comparing Speech-to-Text models in real-world use cases

Compared models

Code to run Universal-2 and Whisper

Evaluation datasets

Results

Standard ASR accuracy

Alphanumerics

Formatting

Hallucinations

Conclusion

What is speaker diarization and how does it work? (Complete 2025 Guide)

Top 8 speaker diarization libraries and APIs in 2025

What is Automatic Speech Recognition? A Comprehensive Overview of ASR Technology

Modern Generative AI for images

How we built our AI Lakehouse

Introducing the AssemblyAI Java SDK

Best Large Language Models (LLMs) & Frameworks in 2024

Top 5 Machine Learning Blogs to Follow

Universal-2 vs OpenAI's Whisper: Comparing Speech-to-Text models in real-world use cases

Compared models

Code to run Universal-2 and Whisper

Evaluation datasets

Results

Standard ASR accuracy

Alphanumerics

Formatting

Hallucinations

Conclusion

Related posts

What is speaker diarization and how does it work? (Complete 2025 Guide)

Top 8 speaker diarization libraries and APIs in 2025

What is Automatic Speech Recognition? A Comprehensive Overview of ASR Technology

Modern Generative AI for images

How we built our AI Lakehouse

Introducing the AssemblyAI Java SDK

Best Large Language Models (LLMs) & Frameworks in 2024

Top 5 Machine Learning Blogs to Follow