November 21, 2025

Why evals in voice AI are so hard (and how to fix them)

Traditional metrics like WER often miss what matters in voice AI. Learn how to build custom evaluation frameworks that align with real user outcomes and product goals.

Voice AI

Ryan Seams

VP, Customer Solutions

Ryan Seams

VP, Customer Solutions

Reviewed by

No items found.

Table of contents

[Visible on live site]

Voice AI evals are hard. Harder than they look.

Humans communicate meaning through tone, pacing, and context, not just words. This creates a fundamental evaluation problem: the metrics that look good in your dashboards often miss what really matters to users.

When evals misalign, teams pick the "wrong" model, or one that performs well on paper but feels off in real conversations. The result? You MIGHT ship a voice agent that technically works but frustrates users in ways your metrics never caught.

Here's how to fix that.

For example:

Consider these two transcriptions from a customer support call:

Transcript A:
"Hello. Ryan from AssemblyAI support. Can I help?"

Transcript B:
"Hello, this is Adam from Assembled AI support. How can I help?"

Run these through a traditional Word Error Rate (WER) evaluation and Transcript B scores significantly better. It's grammatically complete, includes proper articles, and follows conversational conventions.

But any human reading this would choose Transcript A. It gets the critical words right: name and company. That’s what the end user cares about most when they read the text.

Moreover, since Transcript A captures the critical information correctly like the agent's name (Ryan) and the company name (AssemblyAI), the downstream AI workflows that depend on the accuracy of these key terms are more likely to work successfully as well.

This disconnect reveals something fundamental about voice AI evaluation: humans prioritize semantic correctness and clarity of intent over grammatical or lexical accuracy. Your evaluation needs to measure that, or it’ll mislead you.

The case for use-case-specific evals

"Right" isn't universal in voice AI. It’s more nuanced and context-specific.

What matters in customer support differs fundamentally from what matters in medical dictation or voice agent conversations. Your evaluation framework should reflect these differences.

Customer support: Entity accuracy above all

In support contexts, the first utterance sets the tone for the entire interaction. Users need to know immediately that they're talking to the right person at the right company.

Priority metrics include:

Entity recall: Did the model capture names, product IDs, and company identifiers correctly?
First-utterance accuracy: How often does the opening statement contain the critical context?
Issue classification accuracy: Can downstream systems route the call correctly based on the transcript?

WER matters here, but it's secondary. A transcript that perfectly captures filler words but mangles the product name has failed regardless of its WER score.

Dictation and transcription: Verbatim accuracy

Medical dictation, legal transcription, and court reporting require different priorities. Here, you actually do want verbatim accuracy, including the difference between "affect" and "effect" or capturing exact phrasing for legal purposes.

Priority metrics:

WER remains critical: Literal word accuracy directly impacts the output's usefulness
Punctuation accuracy: Proper sentence boundaries change meaning in legal contexts
Speaker attribution: In multi-party dictation, who said what matters

Voice agents: Intent plus emotional tone

Conversational AI agents need to understand what users want and respond appropriately to emotional cues. A user who says, "I've been on hold for 20 minutes," isn't just stating a fact, but is probably expressing frustration or exasperation at the wait time.

Priority metrics:

Intent classification accuracy: Does the model understand what action the user wants?
Conversational flow: Is turn-detection triggered correctly?

The "vibe eval" approach

Sometimes the best evaluation method is the most subjective: pretending you're the end user and noting what feels off.

Vibe evals capture what metrics can't: the intangible qualities that make AI interactions feel natural or awkward. Does the agent sound confident or hesitant? Does it pick up on conversational cues? Does the pacing feel right?

Here's are some tips to run effective vibe evals:

Read transcripts as if you're the user
Don't analyze them as an engineer, but read them as someone trying to solve a problem. What jumps out? What would confuse you? Where would you lose patience?

Talk directly with the voice agent‍

Metrics can't capture the experience of actually interacting with the system. Have real conversations. Try edge cases. Push the boundaries of what it should handle.

Document what feels wrong
"This feels off" isn't actionable, but you can translate vibes into signals:

Long pauses that feel unnatural → measure silence duration between turns
Robotic tone → analyze prosody patterns and pitch variation
Mismatched empathy → compare detected sentiment vs. appropriate response

Convert observations into quantitative signals
Once you know what feels wrong, you can measure it. If agents feel overly formal, measure formality scores. If they interrupt users, measure turn-taking violations.

The takeaway: metrics catch trends; vibes catch bad experiences before they reach users.

Run vibe evals alongside quantitative testing. Use them to generate hypotheses about what metrics you should be tracking. Let them guide which models advance to production testing.

From metrics to meaning: A better voice eval process

Building an effective evaluation system for voice AI requires balancing multiple signal types. Here's a framework that works:

Step 1: Define your product goal

Start with the outcome you want to drive, not the features you want to build.

Weak goal: "Reduce WER to below 5%"
Good goal: "Enable users to resolve common issues without human escalation"

Weak goal: "Improve transcription accuracy"
Good goal: "Allow medical professionals to complete documentation 30% faster"

Your product goal determines which metrics matter and which are just noise.

Step 2: Choose your eval stack

Structure your evaluation in three layers:

Regression metrics catch when you've broken something that previously worked:

Word Error Rate (WER)
Real-Time Factor (RTF) / Latency
Hallucination rate (when models generate words not in the audio)

Custom metrics measure progress toward your product goal:

Entity extraction accuracy for your specific domain
Intent classification F1 score
Conversation success rate (user resolved issue without escalation)

Vibe checks catch UX problems metrics miss:

Internal team conversations with the agent
Small-scale user testing (10-20 conversations)
Review of edge cases and failure modes

Step 3: Iterate and calibrate

Your first eval framework won't be perfect. That's expected.

Run your eval stack on historical data where you know the outcomes. Did high-scoring models correlate with better user experience? If not, adjust the weights or add new metrics.

Compare quantitative scores with vibe eval results. When they disagree, investigate why. Often you'll find a metric you should be tracking but aren't.

Create feedback loops between production performance and eval design. When users report issues that your evals didn't catch, update your eval criteria.

The outcome: You'll pick models that feel right to users, not just look right in dashboards.

Measure what moves the user

Your voice AI should move users toward your product goal. Your evals should measure whether it's working. Most teams evaluate voice AI using academic metrics, not product outcomes. Build evals around what users value and not what's easiest to measure. Combine quantitative metrics with custom ones aligned to your goals. Run vibe evals to catch UX issues. Stop asking "which model has the lowest WER?" Start asking "which model helps users accomplish their goals?"

Test Voice AI for free

Test Voice AI on your own audio in our free, no-code playground.

Test in playground