Automatic Speech Recognition

Universal-2 vs OpenAI's Whisper: Comparing Speech-to-Text models in real-world use cases

Comparing Universal-2, Universal-1, and Whispers models at proper noun and alphanumeric detection tasks, text formatting, and hallucinations.

Universal-2 vs OpenAI's Whisper: Comparing Speech-to-Text models in real-world use cases

In this blog post, we'll compare Universal-2, Universal-1, and two Whisper variants (large-v3 and turbo) in terms of their fitness for real-world Speech-to-Text scenarios.

While all models show impressive Speech-to-Text accuracy overall, this comparison focuses on their performance regarding the finer details that are crucial for readable transcripts and downstream tasks:

  • Proper nouns (e.g. person names, places, brand names)
  • Alphanumerics (e.g. digits, years, phone numbers)
  • Text formatting (e.g. upper/lower case, punctuation)
  • Hallucinations

Compared models

We'll compare the following models:

Model Parameters Required VRAM
Universal-2 600 M -
Universal-1 600 M -
Whisper large-v3 1550 M ~10 GB
Whisper turbo 809 M ~6 GB

Universal-2 is AssemblyAI's latest Speech-to-Text model, showing substantial improvements over its predecessor Universal-1, and achieving best-in-class accuracy.

Whisper large-v3 is a popular open-source model created by OpenAI. The Whisper turbo model is a new optimized version offering faster transcription speed with minimal degradation in accuracy compared to large-v3.

Code to run Universal-2 and Whisper

Before we look at the evaluation, let’s quickly see how you can run Universal-2 and the Whisper models in case you want to run your own evaluations. All models can be easily run via their respective SDKs.

Run Universal-2:

# pip install assemblyai
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

transcript = aai.Transcriber().transcribe("./filename.mp3")

print(transcript.text)

Run Whisper:

# pip install openai-whisper
import whisper

whisper_v3 = whisper.load_model("large-v3")
# or
whisper_turbo = whisper.load_model("turbo")

result = whisper_v3.transcribe("./filename.mp3")

print(result["text"])

Note that Universal-2 is the new default model in AssemblyAI's API for English audio, replacing Universal-1. To run it, you'll need a free API key.

The open-source Whisper models require a GPU with sufficient VRAM (see requirements). A free solution you can use is a Google Colab. Make sure to change the runtime type and select a proper GPU or TPU.

💡
You can use this Google Colab to run all models with a few example audio files.

Evaluation datasets

A detailed performance analysis with a breakdown of all evaluation datasets can be found in the Universal-2 research report. The report also includes additional metrics and comparisons against other model providers.

You can read more about how to evaluate speech recognition models in our blog.

Results

Standard ASR accuracy

We begin by measuring the overall accuracy of the model at the word-level to get a general indicator of the performance of each model.

The metric used is WER (Word Error Rate), which counts the total number of mistakes the model makes at the word level and reports it as a proportion of the total number of words in the "ground truth" transcript. A lower WER is better, for example, a model with a 10% WER makes on average 1 mistake every 10 words.

Universal-2 Universal-1 Whisper large-v3 Whisper turbo
WER 6.68% 6.88% 7.88% 7.75%

Universal-2 leads with the lowest WER at 6.68%, achieving a relative WER reduction of 3% compared to Universal-1. Both Whisper models also perform well, with around one more word error per hundred words on average compared to Universal-2. Interestingly, the optimized turbo variant performed slightly better than large-v3. Compared to other model providers, all four models are superior on this metric. Note that this is for English ASR datasets.

Proper nouns

Next, let’s see how well the models recognize proper nouns such as names, cities, and brands. Proper nouns carry greater information than common words like articles (“the”, “a”, “an”), so it is important to evaluate a model's performance on proper nouns in isolation.

To measure proper noun recognition accuracy, we calculated PNER (Proper Noun Error Rate), a metric calculating the Jaro-Winkler distance between ASR outputs and ground-truth transcripts, focusing on proper nouns extracted from both texts. A lower PNER is better.

Universal-2 Universal-1 Whisper large-v3 Whisper turbo
PNER 13.87% 18.17% 15.41% 18.18%

Universal-2 shows the best proper noun recognition, with a 24% relative reduction in error rate compared to Universal-1. Whisper large-v3 is the second best tested model, with a 11% relative error increase compared to Universal-2. Universal- 1 and Whisper turbo struggle the most with proper nouns.

Example

audio-thumbnail
Proper nouns
0:00
/163.526531

Transcript Universal-2

Hello and welcome to HistoryPod. On 25 August 1875, Captain Matthew Webb became the first person to successfully swim across the English Channel. Webb learnt to swim in the River Severn near the industrial village of Colebrookdale, near the modern English town of Telford. When he was 12 years old, he began training on board the school ship HMS Conway, leading to a career as a sailor on board merchant and passenger vessels. It was while sailing from New York to Liverpool that he first showed his strength as a swimmer by diving into the Atlantic Ocean to save a man who had gone overboard. Although the man was never found, Webb was celebrated for his courage. Webb later became Captain of the Emerald and it was during this time that he read of J.B. johnson's failed attempt to swim across the Channel. Quitting his job, Webb began preparations to make his own crossing, which he did in August 1875. On 12 August, Webb covered himself in porpoise oil to act as insulation against the cold water and entered the Channel at Dover's Admiralty pier. He was forced to abandon this first attempt after a storm at sea made the swim particularly difficult, but he tried again 12 days later. Shortly before 1pm on 24 August, accompanied by three boats, Webb finally stepped ashore near Calais in France. After swimming for approximately 21 hours and 45 minutes. His crossing had seen strong currents regularly move him off course, while a painful jellyfish sting was dealt with by drinking a glass of brandy. Altogether, he had swum 64 kilometres. Matthew Webb's Cross Channel swim secured his position as a Victorian celebrity and he soon found fortune by undertaking other extreme water based challenges. He died in 1883 during a failed attempt to swim across the whirlpool rapids below Niagara Falls.

Transcript Whisper large-v3

Hello, and welcome to HistoryPod. On 25 August 1875, Captain Matthew Webb became the first person to successfully swim across the English Channel. Webb learnt to swim in the River Severn near the industrial village of Colebrookdale near the modern English town of Telford. When he was 12 years old, he began training on board the school ship HMS Conway, leading to a career as a sailor on board merchant and passenger vessels. It was while sailing from New York to Liverpool that he first showed his strength as a swimmer by diving into the Atlantic Ocean to save a man who had gone overboard. Although the man was never found, Webb was celebrated for his courage. Webb later became captain of the Emerald, and it was during this time that he read of J.B. Johnson's failed attempt to swim across the Channel. Quitting his job, Webb began preparations to make his own crossing, which he did in August 1875. On 12 August, Webb covered himself in porpoise oil to act as insulation against the cold water and entered the Channel at Dover's Admiralty Pier. He was forced to abandon this first attempt after a storm at sea made the swim particularly difficult, but he tried again 12 days later, shortly before 1pm on 24 August. Accompanied by three boats, Webb finally stepped ashore near Calais in France after swimming for approximately 21 hours and 45 minutes. His crossing had seen strong currents regularly move him off course, while a painful jellyfish sting was dealt with by drinking a glass of water. The first time Webb was ever seen in the sea was in the early 1850s, when he was caught by a boat. He was caught in a river, and he was caught swimming a glass of brandy. All together he had swum 64 kilometres. Matthew Webb's cross-channel swim secured his position as a Victorian celebrity, and he soon found fortune by undertaking other extreme water-based challenges. He died in 1883 during a failed attempt to swim across the Whirlpool Rapids, below Niagara Falls. He was given 75 years ofenced per!!!!!! Flag 9!!! 9!!! 5!

Alphanumerics

Another factor critical to practical usage scenarios is alphanumerics recognition accuracy. Many real-world audio samples contain sequences of spoken letters and numbers, e.g. phone numbers, ticket numbers, or years. Without accurate transcriptions, incorrect information can get forwarded for downstream processing, e.g., analyzing the audio content with an LLM.

To measure alphanumerics recognition accuracy, we calculated Alphanumerics WER, which is the WER based on a 10-hour dataset created by sampling audio clips rich in alphanumeric content.

Universal-2 Universal-1 Whisper large-v3 Whisper turbo
Alphanumerics WER 4.00% 5.06% 3.84% 4.18%

All models confidently transcribe alphanumerics and outperformed all other tested model providers, with Whisper large-v3 having an edge in this category.

Example

audio-thumbnail
Alphanumerics
0:00
/60.447347

Transcript Universal-2

In this pattern, 3, 8, 6, 11, 9, and so on. Notice that we add 5 to get from 3 to 8. Then we subtract 2 to get from 8 to 6. Then we add 5 to get from 6 to 11. Then we subtract 2 to get From 11 to 9. So the pattern is adding 5 followed by subtracting 2. Continuing with this pattern, 9 plus 5 is 14, 14 minus 2 is 12, and 12 plus 5 is 17. So the next three numbers in the pattern are 14, 12, and 17.

Transcript Whisper large-v3

In this pattern, 3, 8, 6, 11, 9, and so on, notice that we add 5 to get from 3 to 8. Then, we subtract 2 to get from 8 to 6. Then, we add 5 to get from 6 to 11. Then, we subtract 2 to get from 11 to 9. So, the pattern is adding 5 followed by subtracting 2. Continuing with this pattern, 9 plus 5 is 14. 14 minus 2 is 12. And 12 plus 5 is 17. So, the next three numbers in the pattern are 14, 12, and 17. So, we'll get Then, the different patterns will be like this. If you count us with a scale in the diagram, we'll go

Formatting

For correct transcripts that are easy to read, formatting is also important. Specifically, correct punctuation, capitalization, and Inverse Text Normalization (ITN).

To measure formatting accuracy, we calculated U-WER (Unpunctuated Word Error Rate). U-WER is the word error rate computed over formatted outputs from which punctuation marks are deleted. This metric takes into account Truecasing and ITN accuracy, on top of standard ASR accuracy. We also calculated F-WER (Formatted WER), which is similar to U-WER except it additionally measures punctuation accuracy. Note that F-WER tends to fluctuate more than U-WER, given that correct punctuation is not always uniquely determined.

Universal-2 Universal-1 Whisper large-v3 Whisper turbo
U-WER 10.04% 11.78% 12.01% 12.83%
F-WER 15.14% 16.68% 16.84% 17.61%

Universal-2 shows significant improvements over its predecessor and has a clear advantage in this category. Compared to Whisper large-v3 and turbo, Universal-2 shows a 16% and 22% reduction in error rate, respectively. A similar lead for Universal-2 is achieved in F-WER.

Example

audio-thumbnail
Formatting
0:00
/69.012

Transcript Universal-2

Welcome to another edition of Traveler TV. Today we're at the Arthur Ravenel Jr Bridge, located here. It opened in 2005 and is currently the longest cable stayed bridge in the Western Hemisphere. The design features two diamond shaped towers that span the Cooper river and connect downtown Charleston with Mount Pleasant. The bicycle pedestrian paths provide unparalleled views of the harbor and is the perfect spot to catch a sunrise or sunset. To walk or bike the bridge, you can park on either the downtown side here or on the Mount Pleasant side in Memorial Waterfront Park. To learn more about The Arthur Ravenel Jr. Bridge and other fun things to do in Charleston, SC. Visit our website at travelerofcharleston.com or download our free mobile app exploring Charleston SC.

Transcript Whisper large-v3

Welcome to another edition of Traveler TV. Today we're at the Arthur Ravenel Jr. Bridge located here. It opened in 2005 and is currently the longest cable-stayed bridge in the Western Hemisphere. The design features two diamond-shaped towers that span the Cooper River and connect downtown Charleston with Mount Pleasant. The bicycle or pedestrian paths provide unparalleled views of the harbor and is the perfect spot to catch a sunrise or sunset. To walk or bike the bridge, you can park on either the downtown side here or on the Mount Pleasant side in Memorial Waterfront Park. To learn more about the Arthur Ravenel Jr. Bridge and other fun things to do in Charleston, South Carolina, visit our website at TravelerofCharleston.com or download our free mobile app. Traveler of Charleston.com or download our free mobile app. Exploring Charleston SC.

Hallucinations

One key quirk that has been observed for Whisper is its increased propensity for hallucinations.

The Whisper large-v3 model, in particular, has exhibited an increased propensity for hallucinations, often manifesting in long contiguous blocks of consecutive transcription errors. In this recent report, a University of Michigan researcher studying public meetings found hallucinations in 8 out of every 10 audio transcriptions.

If you paid close attention to the Whisper transcripts of the above examples (or examined the outputs in the accompanying Google Colab), you'd have noticed that both the alphanumerics and the proper nouns audio examples contain hallucinations towards the end of the transcript. Despite the fact that Whisper shows strong overall standard ASR accuracy, these hallucinations significantly impact its suitability for real-world use cases.

In our evaluations, the Universal models showed a 30% reduction in hallucination rates compared to Whisper large-v3, making it a more reliable choice for many practical Speech-to-Text applications.

Example

audio-thumbnail
Hallucination example
0:00
/27.09

Transcript Universal-2

All right, friends. So you can see I have raised my balance to 37.7 US dollar. You can see.

Transcript Whisper large-v3

All right friends, so you can see I have raised my balance to 37.7 US dollar. Thank you for watchingINDONESIA Kennel, see you in the next video here CAR interaction.

Conclusion

Universal-2 emerges as the leading model in most categories:

  • Best overall accuracy (6.68% WER)
  • Superior proper noun handling (13.87% PNER)
  • Best formatting accuracy (10.04% U-WER)
  • 30% reduction in hallucination rates compared to Whisper

It shows significant improvements over its predecessor, which is qualitatively demonstrated by a human preference test where 73% of users – nearly 3 out of 4 people – preferred Universal-2's output over Universal-1.

Whisper large-v3 shows some notable strengths and limitations:

  • Best alphanumeric transcription accuracy (3.84% WER)
  • Decent performance across other categories
  • Requires careful consideration due to documented hallucination issues

Whisper turbo offers a balanced trade-off:

  • Notable weakness in proper noun detection (18.18% PNER)
  • Performance close to large-v3 in other metrics. turbo is a good choice over large-v3 when prioritizing speed over accuracy
  • Ideal for local deployments with limited resources (~6GB VRAM)

For a more in-depth evaluation including additional metrics and comparisons against other model providers, read the Universal-2 research report.

If you're interested in learning more about the process of properly evaluating models in an objective, scientific way, you can read this blog post.