Universal-2 vs OpenAI's Whisper: Comparing Speech-to-Text models in real-world use cases
Comparing Universal-2, Universal-1, and Whispers models at proper noun and alphanumeric detection tasks, text formatting, and hallucinations.



In this blog post, we'll compare Universal-2, Universal-1, and two Whisper variants (large-v3 and turbo) in terms of their fitness for real-world Speech-to-Text scenarios.
While all models show impressive Speech-to-Text accuracy overall, this comparison focuses on their performance regarding the finer details that are crucial for readable transcripts and downstream tasks:
- Proper nouns (e.g. person names, places, brand names)
- Alphanumerics (e.g. digits, years, phone numbers)
- Text formatting (e.g. upper/lower case, punctuation)
- Hallucinations
Compared models
We'll compare the following models:
Universal-2 is AssemblyAI's latest Speech-to-Text model, showing substantial improvements over its predecessor Universal-1, and achieving best-in-class accuracy.
Whisper large-v3 is a popular open-source model created by OpenAI. The Whisper turbo model is a new optimized version offering faster transcription speed with minimal degradation in accuracy compared to large-v3
.
Code to run Universal-2 and Whisper
Before we look at the evaluation, let’s quickly see how you can run Universal-2 and the Whisper models in case you want to run your own evaluations. All models can be easily run via their respective SDKs.
Run Universal-2:
# pip install assemblyai
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
transcript = aai.Transcriber().transcribe("./filename.mp3")
print(transcript.text)
Run Whisper:
# pip install openai-whisper
import whisper
whisper_v3 = whisper.load_model("large-v3")
# or
whisper_turbo = whisper.load_model("turbo")
result = whisper_v3.transcribe("./filename.mp3")
print(result["text"])
Note that Universal-2 is the new default model in AssemblyAI's API for English audio, replacing Universal-1. To run it, you'll need a free API key.
The open-source Whisper models require a GPU with sufficient VRAM (see requirements). A free solution you can use is a Google Colab. Make sure to change the runtime type and select a proper GPU or TPU.
You can use this Google Colab to run all models with a few example audio files.
Evaluation datasets
A detailed performance analysis with a breakdown of all evaluation datasets can be found in the Universal-2 research report. The report also includes additional metrics and comparisons against other model providers.
You can read more about how to evaluate speech recognition models in our blog.
Results
Standard ASR accuracy
We begin by measuring the overall accuracy of the model at the word-level to get a general indicator of the performance of each model.
The metric used is WER (Word Error Rate), which counts the total number of mistakes the model makes at the word level and reports it as a proportion of the total number of words in the "ground truth" transcript. A lower WER is better, for example, a model with a 10% WER makes on average 1 mistake every 10 words.
Universal-2 shows the best proper noun recognition, with a 24% relative reduction in error rate compared to Universal-1. Whisper large-v3
is the second best tested model, with a 11% relative error increase compared to Universal-2. Universal- 1 and Whisper turbo
struggle the most with proper nouns.
Alphanumerics
Another factor critical to practical usage scenarios is alphanumerics recognition accuracy. Many real-world audio samples contain sequences of spoken letters and numbers, e.g. phone numbers, ticket numbers, or years. Without accurate transcriptions, incorrect information can get forwarded for downstream processing, e.g., analyzing the audio content with an LLM.
To measure alphanumerics recognition accuracy, we calculated Alphanumerics WER, which is the WER based on a 10-hour dataset created by sampling audio clips rich in alphanumeric content.
All models confidently transcribe alphanumerics and outperformed all other tested model providers, with Whisper large-v3
having an edge in this category.
Formatting
For correct transcripts that are easy to read, formatting is also important. Specifically, correct punctuation, capitalization, and Inverse Text Normalization (ITN).
To measure formatting accuracy, we calculated U-WER (Unpunctuated Word Error Rate). U-WER is the word error rate computed over formatted outputs from which punctuation marks are deleted. This metric takes into account Truecasing and ITN accuracy, on top of standard ASR accuracy. We also calculated F-WER (Formatted WER), which is similar to U-WER except it additionally measures punctuation accuracy. Note that F-WER tends to fluctuate more than U-WER, given that correct punctuation is not always uniquely determined.
Universal-2 shows significant improvements over its predecessor and has a clear advantage in this category. Compared to Whisper large-v3
and turbo
, Universal-2 shows a 16% and 22% reduction in error rate, respectively. A similar lead for Universal-2 is achieved in F-WER.
Hallucinations
One key quirk that has been observed for Whisper is its increased propensity for hallucinations.
The Whisper large-v3
model, in particular, has exhibited an increased propensity for hallucinations, often manifesting in long contiguous blocks of consecutive transcription errors. In this recent report, a University of Michigan researcher studying public meetings found hallucinations in 8 out of every 10 audio transcriptions.
If you paid close attention to the Whisper transcripts of the above examples (or examined the outputs in the accompanying Google Colab), you'd have noticed that both the alphanumerics and the proper nouns audio examples contain hallucinations towards the end of the transcript. Despite the fact that Whisper shows strong overall standard ASR accuracy, these hallucinations significantly impact its suitability for real-world use cases.
In our evaluations, the Universal models showed a 30% reduction in hallucination rates compared to Whisper large-v3
, making it a more reliable choice for many practical Speech-to-Text applications.
Conclusion
Universal-2 emerges as the leading model in most categories:
- Best overall accuracy (6.68% WER)
- Superior proper noun handling (13.87% PNER)
- Best formatting accuracy (10.04% U-WER)
- 30% reduction in hallucination rates compared to Whisper
It shows significant improvements over its predecessor, which is qualitatively demonstrated by a human preference test where 73% of users – nearly 3 out of 4 people – preferred Universal-2's output over Universal-1.
Whisper large-v3
shows some notable strengths and limitations:
- Best alphanumeric transcription accuracy (3.84% WER)
- Decent performance across other categories
- Requires careful consideration due to documented hallucination issues
Whisper turbo
offers a balanced trade-off:
- Notable weakness in proper noun detection (18.18% PNER)
- Performance close to
large-v3
in other metrics.turbo
is a good choice overlarge-v3
when prioritizing speed over accuracy - Ideal for local deployments with limited resources (~6GB VRAM)
For a more in-depth evaluation including additional metrics and comparisons against other model providers, read the Universal-2 research report.
If you're interested in learning more about the process of properly evaluating models in an objective, scientific way, you can read this blog post.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.