Announcement

Beyond Word Error Rate: Universal-2 Delivers Accuracy Where It Matters

Our most advanced speech-to-text model captures the real-world complexity of human speech.

Universal-2 Comprehensive Speech-to-Text for solving last-mile challenges

Building AI applications with speech recognition should be straightforward: process audio, get structured data, take action. Yet despite the industry's claims of +90% accuracy, developers face a persistent challenge: the gap between raw audio files and reliable, structured outputs.

The hidden cost of "good enough" speech-to-text

Consider a simple example: Your application needs to parse "sarah.johnson@acme-corp.com" from an audio stream. Instead, you get "Sarah dot Johnson at acme hyphen core dot com" – technically accurate, but programmatically useless. 

Or take a phone number: "555-555-5555" becomes "five five five five five five five five five five" – good luck implementing that regex.

These aren't just formatting issues. They're fundamental problems that impact:

  • Data Structure: Email addresses that don't validate
  • Program Flow: Phone numbers that can't be dialed
  • API Integration: Dates and times in inconsistent formats
  • User Experience: Poor quality output that erodes customer trust

The problem isn't necessarily word accuracy — it's the mismatch between audio inputs and what applications actually need for automation. This is why we built Universal-2.

Instead of optimizing for Word Error Rate (WER), we focused on delivering immediately usable data: properly formatted emails, validated phone numbers, and structured timestamps – the kind of output that lets you build reliable, production-ready applications.

The new standard for speech recognition

While the industry fixates on WER, real-world applications demand more. Consider this: Universal-2 builds on Universal-1's industry-leading WER (6.68%) with just a 3% improvement to 6.88%. 

Yet in blind human evaluation tests, 73% of users – nearly 3 out of 4 people – preferred Universal-2's output. Why? Because in today's AI-driven environment, the true measure of speech recognition lies in what your applications can do with the output:

  • Can your AI notetaker distinguish between "Dave from Product" and "Dave from Marketing" and capture action items?
  • Can your sales intelligence platform show "2:30 PM EST" instead of "two thirty p.m. eastern standard time"?
  • Can your call center agents see "1-555-0123" instead of "one five five five zero one two three" in their customer records?
  • Can your conversation analytics distinguish between "Sarah from Salesforce" and "Sara from StateForce" in pipeline reports?
Results of the side-by-side human preference test between Universal-2 and Universal-1. Only samples where at least two-thirds of the judges agreed on their ratings were included., which accounted for 93% of all test samples.
Results of the side-by-side human preference test between Universal-2 and Universal-1. Only samples where at least two-thirds of the judges agreed on their ratings were included., which accounted for 93% of all test samples.

Critical last-mile challenges solved

These real-world demands have pushed us to rethink how we measure and deliver speech recognition accuracy. While traditional metrics paint a broad picture, the true test of speech-to-text value lies in how it handles the critical details that power business applications.

Universal-2 addresses these challenges with breakthrough improvements:

  1. A 24% improvement in the recognition of rare words like names, brands, locations, and more for more personalized customer-facing communications, intuitive automated systems, and cleaner integration processes.
  2. A 21% increase in accuracy in alphanumerics across critical data like phone numbers, zip codes, and other numerical identifiers for smoother customer experiences, better critical data management, and clearer escalation and reporting.
  3. A 15% improvement in text formatting with proper punctuation and casing across things like emails, dates, and dollar amounts for faster information navigation and more natural transcripts in customer products.
This bar chart compares Universal-2, Universal-1, commercial ASR providers, and Whisper large-v3 across several categories. Word Error Rate (WER) is used for Standard ASR, Alphanumerics, and Accented Speech. Proper Noun Error Rate (PNER) is used for Proper Nouns. The performance of “ASR + ITN + Truecasing” is evaluated using Unpunctuated WER (U-WER). To improve readability, bars are truncated at 25% to minimize the impact of outliers.
This bar chart compares Universal-2, Universal-1, commercial ASR providers, and Whisper large-v3 across several categories. Word Error Rate (WER) is used for Standard ASR, Alphanumerics, and Accented Speech. Proper Noun Error Rate (PNER) is used for Proper Nouns. The performance of “ASR + ITN + Truecasing” is evaluated using Unpunctuated WER (U-WER). To improve readability, bars are truncated at 25% to minimize the impact of outliers.

Here's how these improvements transform three crucial business scenarios:

Sales intelligence that drives revenue

A sales team reviews a discovery call where a prospect says: "We're currently using Zoom.ai for our European team of 250 people, but we're looking to consolidate vendors by Q2." With Universal-2, every critical detail is captured precisely: the competitor name, team size, location, and timeline. This level of accuracy means sales teams can:

  • Track competitive opportunities accurately
  • Size deals based on precise user counts
  • Prioritize opportunities based on accurate timelines

Customer support that gets it right the first time

During a support call a customer explains: "I've been trying to activate my iPhone 15 Pro. The IMEI is 35-824919-198374-1, and I'm getting error code AX-2103." Universal-2 captures and formats the digits and codes, enabling support teams to:

  • Look up exact product details immediately
  • Reference correct error codes without callbacks
  • Update customer records accurately the first time

Telehealth platforms that reduce administrative burden

In a telehealth consultation, a provider mentions: "We'll schedule your follow-up for October 30th at 2:30 PM EST. I'm prescribing amoxicillin, 500mg, NDC 43063-0545-30." Universal-2 ensures:

  • Appointments are scheduled with correct dates and times
  • Medication details are captured with perfect accuracy
  • Insurance codes and patient information are properly formatted

Improving conversation intelligence

By solving critical last-mile challenges, Universal-2 isn't just improving accuracy —it's enabling the next generation of AI-native applications. You can now build sophisticated conversation intelligence systems that turn raw audio data into sharper insights, faster workflows, and best-in-class product experiences.

This enables your applications to deliver:

  1. Real-time intelligence: Capture competitive insights and customer signals as they happen.
  2. Automated workflows: Trigger actions based on accurately captured details without manual verification.
  3. Structured data: Transform raw conversations into properly formatted, immediately usable business data.
  4. Scalable analysis: Process millions of hours of conversations with confidence in the details.

This fundamental shift in accuracy and reliability means you can build AI applications that not just transcribe conversations, but understand and act on them in real-time.

Technical breakthroughs driving real-world improvements

While most speech-to-text providers focus solely on reducing WER, Universal-2's architecture was designed to solve the complex challenges of modern business communication. Our approach focused on three key technical innovations that directly address the last-mile challenges in speech recognition.

Innovation #1: tokenization for real-world speech

One of the most challenging aspects of speech recognition is handling repeated sequences – think phone numbers, product codes, or any sequence of repeated characters. Traditional models often struggle with these patterns, leading to dropped digits or mangled codes.

Universal-2 introduces a breakthrough approach using a special <repeat_token> in its tokenization scheme. When processing a sequence like a phone number (555-555-5555), the model no longer needs to predict the same digit multiple times in succession. Instead, it can recognize patterns of repetition, leading to:

  • Up to 90% improvement in accuracy for repeated sequences
  • Precise capture of phone numbers and product codes
  • Accurate handling of any repetitive patterns in speech
WER on synthetic datasets with consecutively repeating digits/words obtained by Universal-2, in comparison to Universal-1.
WER on synthetic datasets with consecutively repeating digits/words obtained by Universal-2, in comparison to Universal-1.

The impact: A 21% improvement in alphanumeric accuracy that captures critical information like phone numbers, product codes, and customer IDs.

Innovation #2: enhanced proper noun recognition

For business conversations, accuracy with proper nouns – company names, products, people – is crucial. Universal-2 achieves this through two key technical advances:

Expanded training data

  • Doubled supervised training data from 150,000 to 300,000 hours
  • Enhanced data cleaning pipeline for higher quality training
  • Focused proper noun coverage across industries

Advanced neural architecture

  • Improved token-level prediction for complex proper nouns
  • Enhanced context understanding for brand and product names
  • Better handling of industry-specific terminology
Performance on proper noun test set, obtained by Universal-2, in comparison to Universal-1 and other open-source and commercial ASR systems.
Performance on proper noun test set, obtained by Universal-2, in comparison to Universal-1 and other open-source and commercial ASR systems.

The impact: A 24% improvement in proper noun recognition accuracy like names, brands, and locations, which are essential for maintaining context in conversations.

Innovation #3: neural text formatting pipeline

Perhaps the most visible improvement in Universal-2 is its ability to produce properly formatted output. We've completely reimagined our text formatting pipeline with an all-neural architecture approach that includes two models:

Multi-objective tagging model

  • Unified transformer architecture for punctuation and casing
  • Context-aware formatting decisions
  • Improved handling of mixed-case words and special formats

Text span conversion model

  • Selective application of formatting to relevant sections
  • Enhanced numerical and date formatting
  • Reduced computational overhead through smart targeting

The Impact: A 15% improvement in text formatting accuracy, delivering immediately readable and actionable outputs while maintaining real-time processing speed.

F-WER (Formatted WER) and U-WER (Unpunctuated WER) on Universal-2, Universal 1 and popular speech-to-text models.
F-WER (Formatted WER) and U-WER (Unpunctuated WER) on Universal-2, Universal 1 and popular speech-to-text models.

Performance without compromise

These improvements don't come at the cost of performance — Universal-2 maintains the speed and efficiency needed for real-time applications while delivering these substantial accuracy improvements.

Beyond word error rate

When evaluating speech recognition providers, looking at Word Error Rate (WER) alone leads to a costly reality: your users end up seeing "sarah dot johnson at acme hyphen corp dot com" instead of proper email addresses, "state force" instead of "Salesforce," and "one five five five" instead of formatted phone numbers.

Every mangled email, every spelled-out number, every garbled company name erodes user trust and creates frustrating product experiences – no matter how impressive the accuracy claim might be.

Measuring what actually matters

The fixation on WER has created a dangerous blind spot in how we evaluate speech recognition. Getting 90% of words right means little when the critical elements – email addresses, product codes, company names, speaker attribution – come back in formats that frustrate users and break automated workflows. 

Universal-2 proves that speech recognition evaluation must go beyond WER. It's about measuring accuracy where it matters most:

  1. 24% better proper noun accuracy - significantly improving capture of company names, locations, and customer names
  2. 21% improvement in alphanumerics - delivering more reliable phone numbers, credit cards, and product codes
  3. 15% enhancement in formatting accuracy - delivering consistently better formatted dates, times, and prices

By solving critical last-mile challenges, Universal-2 enables a new generation of AI-native applications that can:

  • Transform raw speech into structured business data
  • Power real-time decision-making from voice interactions
  • Enable sophisticated AI analysis of customer conversations
  • Drive automated workflows from spoken interactions

This isn't just better speech recognition — it's reliable data quality for your applications. It's why 73% of users preferred Universal-2 in blind tests over Universal-1. They experienced what we've known all along: accuracy only matters when it captures the details that drive business value.

Start building on Universal-2 today

Universal-2 is available now, ready to power your next generation of AI applications. Get started with our API for free and experience accuracy where it matters most.