Evaluating Pre-recorded STT models

Introduction

The high level objective of a pre-recorded STT model evaluation is to answer the question: Which Speech-to-text model is the best for my product? This guide provides a step-by-step framework for evaluating and benchmarking pre-recorded Speech-to-text models, with specific guidance for evaluating Universal-3.5 Pro and its prompting capabilities.

Need help with evaluations or prompt optimization? Contact our Sales team — we can help you design an evaluation, optimize prompts for your audio, and benchmark against your ground truth data.

Evaluation metrics

Traditional metrics

Word Error Rate (WER)

WER = \frac{S + D + I}{N}

This formula takes the number of Substitutions (S), Deletions (D), and Insertions (I), and divides their sum by the Total Number of Words in the ground truth transcript (N).

While WER calculation may seem simple, it requires a methodical granular approach and reliable reference data. Word Error Rate can tell you how “different” the automatic transcription was compared to the human transcription, and generally, this is a reliable metric to determine how “good” a transcription is. For more info on WER as a metric, read Dylan Fox’s blog post on Word Error Rate.

Concatenated minimum-Permutation Word Error Rate (cpWER)

\text{cpWER} = \frac{S_{\text{spk}} + D + I}{N}

cpWER is similar to WER, but it also measures the number of errors a speech recognition model makes where words with incorrectly-ascribed speakers are considered to be incorrect. The primary difference from standard WER is how S is calculated:

S_{\text{spk}}

counts both word substitutions and correctly transcribed words that are assigned to the wrong speaker. A correct word with an incorrect speaker label counts as a substitution error, thereby penalizing both transcription and speaker diarization mistakes.

Formatted WER (F-WER)

F-WER is similar to WER but F-WER does not apply text normalization, so all formatting differences are accounted for, in addition to word differences when computing the WER. Therefore, F-WER is always higher than or equal to WER.

Sentence Error Rate (SER)

\text{SER} = \frac{N_{\text{err}}}{N_{\text{sent}}}

The Sentence Error Rate (SER) is the ratio of the number of sentences with one or more errors to the total number of sentences.

Diarization Error Rate (DER)

DER = \frac{false alarm + missed detection + confusion}{total}

This formula takes the duration of non-speech incorrectly classified as speech (false alarm), the duration of speech incorrectly classified as non-speech (missed detection), the duration of speaker confusion (confusion), and divides the sum over the total speech duration.

Missed Entity Rate (MER)

\text{MER} = 1 - \frac{N_{\text{rec}}}{N_{\text{total}}}

Fundamentally, MER is a negative recall rate computed for specified target entities. It is defined as the number of correctly transcribed entities relative to their total occurrence count. It accounts for multiple occurrences of the same entity and their positions within the hypothesis transcription. Our Research team proposes this as the best metric to measure the effectiveness of word boost. For a simpler approach when evaluating high-stakes entities (such as credit card numbers, names, or dosages), consider a binary pass/fail metric per file: score each file as 1 if the model captured the target entity correctly, or 0 if it did not. Run this across 100 or more files for statistical reliability. This is especially useful when you need a quick signal on entity accuracy at scale without computing full MER breakdowns.

Metrics for Universal-3.5 Pro

Universal-3.5 Pro is significantly more capable than prior models, and traditional WER alone may not fully capture its performance. The following metrics provide a more complete picture.

Semantic WER

Traditional WER treats every difference between the model output and a reference transcript as an error—even when the difference is semantically equivalent. Semantic WER corrects this by normalizing equivalent words and phrases before calculating WER, so that differences like dr. vs doctor or 1300 vs thirteen hundred aren’t counted as errors.

Rule-based normalization

At its simplest, Semantic WER is a preprocessing step. Before running standard WER, apply find-and-replace rules to both the reference and hypothesis transcripts:

Number formats: 1300 → thirteen hundred, $5 → five dollars
Abbreviations and titles: dr. → doctor, mr. → mister, govt → government
Contractions: gonna → going to, can't → cannot
Variant spellings: grey → gray, cancelled → canceled
Filler words: Remove um, uh, you know from both sides (or keep both—just be consistent)

This alone eliminates a significant portion of false errors and can be implemented in a few lines of Python. No model inference required.

LLM-based scoring

For cases where simple rules can’t capture the nuance—was an omission meaningful? Is a proper noun misspelling close enough?—an LLM can perform word-level alignment and classify each difference by severity:

No penalty: Semantically equivalent forms (number formats, contractions, variant spellings)
Minor penalty: Single-character misspellings, minor grammatical markers
Major penalty: Incorrect substitutions, meaning-altering errors, significant omissions or additions of content words

These approaches are particularly valuable for Universal-3.5 Pro because the model often transcribes audio more accurately than human transcribers, producing differences that are correct but would be penalized by traditional WER. For an implementation of Semantic WER using Bayesian optimization, see prompt-seeker.

LASER score (LLM-based ASR Evaluation Rubric)

LASER is a published LLM-based evaluation metric (Parulekar & Jyothi, EMNLP 2025) that uses an LLM prompt with detailed examples to classify ASR errors and compute a score:

\text{LASER} = 1 - \frac{\text{Total Penalty}}{\text{Reference Word Count}}

The LLM aligns each word in the ASR output against the reference transcription and assigns a penalty per word pair:

No penalty (0): Acceptable variations including numerical format differences, abbreviations, compound word splits, transliterations, alternate spellings, proper noun variants, and colloquial terms
Minor penalty (0.5): Small spelling errors (single character) or minor grammatical errors (gender, tense, number markers) that preserve sentence meaning
Major penalty (1.0): Incorrect word substitutions, significant omissions or additions, and reordering that changes meaning

LASER provides structured per-error feedback alongside the score. This makes it useful for prompt optimization workflows where you need to understand why a prompt performed poorly, not just how much error there was. For an implementation of LASER scoring, see aai-cli.

Why new metrics matter

Traditional WER treats every difference between the model output and human transcription as an error. Universal-3.5 Pro’s contextual awareness means it will often transcribe words that human transcribers missed entirely. In traditional WER, these show up as insertions (penalized errors), even though the model is correct. This makes WER an unreliable metric when used alone — your evaluation is only as good as your ground truth labels.

WER is only as good as your ground truth labels. Human transcriptions contain systematic errors — missed filler words, incorrect proper nouns, simplified speech patterns, and translated code-switching. When Universal-3.5 Pro transcribes audio more accurately than the human label, those improvements show up as WER errors.Before reporting WER, manually audit at least 20 insertions to determine what percentage are true errors versus ground truth omissions. In our testing, the majority of insertions were cases where Universal-3.5 Pro correctly transcribed audio that the human transcriber missed.

This is why Artificial Analysis, an independent AI benchmarking organization, had to create proprietary evaluation datasets with manually corrected ground truths when building their Speech-to-Text leaderboard. Existing public datasets contain systematic human transcription errors that penalize models which are actually more accurate.

The evaluation process

This section provides a step-by-step guide on how to run an evaluation. The evaluation process should closely match your production environment, including the files you intend to transcribe, the model you intend to use, and the settings applied to those models.

Step 1: Prepare your evaluation dataset

Ensure that the files you use to benchmark are representative of the files you plan to use in production. For example, if you plan to transcribe meetings, gather a set of meeting recordings. If you plan to transcribe phone calls, focus on finding phone calls that match your customer base’s language and region. We recommend using at least 25 files that are representative of your use case. Length is less important than diversity of audio conditions — a good evaluation set covers the range of speakers, accents, audio quality, and vocabulary your model will encounter in production. Then, gather human-labeled data to act as your source of ground truth. Ground truth is accurately transcribed audio data that will serve as the “correct answer” for our benchmark. Human-labeled data can be purchased from an external vendor or created manually.

Open-source audio corpora (for example, datasets on Hugging Face) can serve as a starting point for building ground truth, but they require review and correction before use in production evaluations. These datasets contain the same systematic human transcription errors described below — missing filler words, incorrect proper nouns, and simplified speech patterns — and should be audited against your actual audio before benchmarking.

Ground truth quality

The quality of your ground truth data directly affects the reliability of your evaluation. With Universal-3.5 Pro, this is more important than ever because the model frequently outperforms human transcribers. Common issues with ground truth data:

Missing filler words: Human transcribers often omit um, uh, like, and other disfluencies
Incorrect proper nouns: Rare names, technical terms, and domain vocabulary are often misspelled
Simplified speech patterns: Human transcribers tend to “clean up” speech, missing repetitions, false starts, and self-corrections
Code-switching errors: Multilingual segments are frequently translated to English rather than transcribed as spoken

Before running evaluations, audit a sample of your ground truth files by listening to the audio and comparing. If your ground truth contains systematic errors, your WER numbers will be misleading. To inspect and correct issues in your ground truth files, use the Truth File Corrector in the AssemblyAI Dashboard (found at the bottom of the left sidebar), which lets you listen back to audio and fix human transcription errors by clicking through differences.

See this article to learn more about why your word error rate (WER) benchmark might be lying to you.

Dataset diversity

A prompt that performs well overall may underperform on specific audio types. Include a diverse mix of audio in your evaluation set and track per-dataset breakdowns:

Audio type	Characteristics	Typical WER range
Earnings calls	Clean English, formal vocabulary	Low
Meeting recordings	Multi-speaker, informal	Moderate
Code-switching audio	Mixed languages (e.g., English/Spanish)	Higher (normalization affects scoring)
Medical consultations	Clinical vocabulary, accented speech	Moderate
Phone calls	Compression artifacts, background noise	Moderate to high

Step 2: Establish a baseline

Before optimizing prompts, measure your baseline performance by transcribing your evaluation set with Universal-3.5 Pro and no custom prompt. The built-in default is already applied when prompt is omitted, and it outperforms most custom prompts. Record both WER and Semantic WER so you can track improvements as you layer instructions on top.

Step 3: Transcribe and evaluate with prompts

Transcribe your files using AssemblyAI’s API with Universal-3.5 Pro and your candidate prompts. When crafting evaluation prompts, use the prompting guide as a reference. Key principles:

Use authoritative language: The model responds better to Mandatory:, Required:, and Always: than soft language like try to or please
Be specific about speech patterns: Enumerate what you want preserved (disfluencies, filler words, hesitations, repetitions, stutters, false starts, colloquialisms)
Give instructions, not just context: This is a doctor-patient visit. Prioritize accurately transcribing medications and diseases wherever possible. is far more effective than This is a doctor-patient visit.
Start with fewer instructions, add one at a time: Every added instruction risks conflicting with another. Add a single instruction, evaluate it against your dataset, and only then add the next.

Step 4: Text normalization

Before calculating WER metrics, both reference (ground truth) and hypothesis (model generated) texts need to be normalized to ensure a fair comparison. This accounts for differences in:

Punctuation and capitalization
Number formatting (e.g., “twenty-one” vs. “21”)
Contractions and abbreviations
Other stylistic variations that don’t affect meaning

Normalization can be done with a library like Whisper Normalizer.

If you are prompting Universal-3.5 Pro to include [unclear] or [masked] tags for uncertain audio, ensure your normalizer strips these tags before computing WER. Otherwise, they will be counted as insertions.

Step 5: Compare and calculate

Calculate the error rates using the formulas above or consider using a library like jiwer. For Semantic WER, apply text normalization replacements before calculating WER. For LASER scoring, use an LLM-based evaluator (see Open-source tools below). When reviewing results:

Check per-dataset breakdowns, not just aggregate WER
Audit insertions manually by listening to the audio
Compare both traditional WER and Semantic WER to get a full picture
Track which prompt components improve which audio types

Qualitative analysis

Quantitative metrics don’t capture everything. Qualitative analysis helps you identify differences between STT providers that metrics might miss — for example, how certain key terms are transcribed can make or break a transcript, even if the rest of the transcript has a lower overall error rate. Qualitative analysis is also useful for tie-breaking when benchmarking metrics don’t clearly favor one model over another. Since you’re comparing models against each other, ground truth files aren’t required. Side-by-side comparison: Have users compare and pick their preferred transcript between two formatted outputs from different STT providers. Tools like Diffchecker or any side-by-side interface work well for this. LLM as judge: An LLM can automatically identify differences between two transcriptions and pick a winner. However, be cautious: an LLM judge can be misled by outputs that look correct but contain subtle errors (such as translated code-switching segments that read well in English but don’t reflect what was actually spoken). Always pair LLM-based judgments with spot-checking against the actual audio. A/B testing in production: Serve transcripts from different providers to users and collect feedback. You can ask users to score transcripts directly, or track indirect signals like the number of support ticket complaints about transcription quality.

Domain-specific evaluation considerations

WER is not always the right primary metric. Some domains prioritize output qualities that traditional accuracy metrics do not capture:

Medical scribes: Customers often evaluate based on user preference rate and readability — whether clinicians prefer the transcript output for generating clinical notes. Formatting quality, medical terminology accuracy, and structured output can matter more than raw WER. See the Medical Scribe guides for domain-specific evaluation guidance.
Legal transcription: Verbatim accuracy including disfluencies and speaker attribution may be more important than clean, readable output.
Media and entertainment: Proper noun accuracy for names, places, and brands can outweigh overall WER.

When running evaluations for domain-specific use cases, define your success criteria before choosing metrics. If your end users care about readability and preference, include qualitative evaluation (side-by-side comparisons, user preference scoring) alongside quantitative metrics.

Iterating on prompts

Finding the optimal prompt for your use case is an iterative process. There are two main approaches:

Manual iteration

Start with the default system prompt or one of the reference prompts below
Transcribe a representative sample of your audio
Review the output against your ground truth, focusing on the types of errors that matter most for your use case
Adjust the prompt to address specific error patterns
Re-evaluate and compare

Automated optimization

For large-scale prompt optimization, consider using one of the open-source tools described below. These tools systematically test prompt component combinations and score them against your evaluation data, converging on the best prompt for your specific audio.

Reference prompts for evaluation

Use these prompts directly from the prompting guide as your evaluation prompts.

Evaluation prompt

Start with the built-in default (Best all around). Omit the prompt parameter to use it — you don’t need to set it explicitly:

Transcribe with context and proper nouns preserved, where speech is
present in the audio. Each language as spoken. English as English.
Non-native speakers.

For maximum verbatim capture and multilingual code-switching, use the Verbatim with multilingual support prompt instead. The trade-off is that the model may occasionally hallucinate disfluencies or language switches that don’t exist in the audio.

Comparison prompt (for identifying model uncertainty)

This is the Handling unclear audio with [unclear] prompt. Run it alongside the evaluation prompt on the same audio and diff the outputs to find where the model is guessing:

Always: Transcribe speech exactly as heard. If uncertain or audio is
unclear, mark as [unclear]. After the first output, review the transcript
again. Pay close attention to hallucinations, misspellings, or errors,
and revise them like a computer performing spell and grammar checks.
Ensure words and phrases make grammatical sense in sentences.

By comparing the two outputs, you can identify exactly which segments the model is least confident about. This is useful for:

Evaluating how the model handles unclear or noisy audio
Finding segments where the model’s guesses may be incorrect
Prioritizing which audio segments to manually review
Understanding whether WER differences are coming from genuine errors or uncertain segments

What works and what doesn’t

The authoritative list lives in the prompting guide — see What works / what to avoid. The same rules apply when building evaluation prompts: lead with Transcribe…, use authoritative language (Required:, Mandatory:, Always:), describe the pattern to watch for, and add instructions one at a time.

Listing specific word examples in your prompt causes hallucinations. The model becomes over-eager to insert those exact words into the transcript, even when they weren’t spoken. For example, Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman) will cause the model to hallucinate those drug names. Instead, describe the pattern of entities to prioritize: Pharmaceutical accuracy required across all medications and drug names. If you already know the specific terms, use keyterms prompting instead — it’s optimized for term boosting and more reliable than describing terms in a free-form prompt. See the prompting guide for more details.

Open-source tools

aai-cli

aai-cli is a command-line tool for evaluating and optimizing transcription prompts. It supports:

Prompt evaluation: Score a prompt against datasets from Hugging Face or your own audio files using WER and LASER metrics
Prompt optimization: Automatically iterate on prompts using DSPy GEPA with LASER feedback, where an LLM reflects on transcription errors and proposes improved prompts
Dataset discovery: Search and load audio datasets from Hugging Face for benchmarking

# Evaluate a prompt
aai eval --prompt "Transcribe verbatim." --max-samples 50

# Optimize a prompt
aai optimize --starting-prompt "Transcribe verbatim." --iterations 5 --samples 50

prompt-seeker

prompt-seeker uses Bayesian optimization (Optuna TPE) to systematically find the best transcription prompt by testing component combinations across diverse audio datasets and scoring with Semantic WER. It supports:

Component-based optimization: Modular prompt pieces (language, disfluency, punctuation, etc.) are tested in combinations
Meta-optimization: An LLM designs new component spaces between optimization rounds based on accumulated findings
Per-dataset analysis: Breakdown of what works for each audio type in your evaluation set

# Run optimization (50 trials across your data)
uv run python -m prompt_seeker.cli optimize \
  --datasets "my_calls:50" \
  --trials 50 -c 20

# Run the meta-optimizer (Claude designs rounds autonomously)
uv run python -m prompt_seeker.cli meta-optimize \
  --datasets "my_calls:50" \
  --rounds 3 --trials 50 -c 10

Both tools require ground truth transcriptions for scoring. If you don’t have ground truth yet, transcribe a sample of your audio manually and use that as your starting point.

Conclusion

Evaluating Universal-3.5 Pro requires going beyond traditional WER. The model’s contextual awareness and prompting capabilities mean that evaluation is as much about finding the right prompt as it is about measuring accuracy. Use Semantic WER or LASER alongside traditional WER, audit your ground truth data carefully, and iterate on prompts systematically to find the best configuration for your audio.

​Introduction

​Evaluation metrics

​Traditional metrics

​Word Error Rate (WER)

​Concatenated minimum-Permutation Word Error Rate (cpWER)

​Formatted WER (F-WER)

​Sentence Error Rate (SER)

​Diarization Error Rate (DER)

​Missed Entity Rate (MER)

​Metrics for Universal-3.5 Pro

​Semantic WER

Rule-based normalization

LLM-based scoring

​LASER score (LLM-based ASR Evaluation Rubric)

​Why new metrics matter

​The evaluation process

​Step 1: Prepare your evaluation dataset

​Ground truth quality

​Dataset diversity

​Step 2: Establish a baseline

​Step 3: Transcribe and evaluate with prompts

​Step 4: Text normalization

​Step 5: Compare and calculate

​Qualitative analysis

​Domain-specific evaluation considerations

​Iterating on prompts

​Manual iteration

​Automated optimization

​Reference prompts for evaluation

​Evaluation prompt

​Comparison prompt (for identifying model uncertainty)

​What works and what doesn’t

​Open-source tools

​aai-cli

​prompt-seeker

​Conclusion

Introduction

Evaluation metrics

Traditional metrics

Word Error Rate (WER)

Concatenated minimum-Permutation Word Error Rate (cpWER)

Formatted WER (F-WER)

Sentence Error Rate (SER)

Diarization Error Rate (DER)

Missed Entity Rate (MER)

Metrics for Universal-3.5 Pro

Semantic WER

LASER score (LLM-based ASR Evaluation Rubric)

Why new metrics matter

The evaluation process

Step 1: Prepare your evaluation dataset

Ground truth quality

Dataset diversity

Step 2: Establish a baseline

Step 3: Transcribe and evaluate with prompts

Step 4: Text normalization

Step 5: Compare and calculate

Qualitative analysis

Domain-specific evaluation considerations

Iterating on prompts

Manual iteration

Automated optimization

Reference prompts for evaluation

Evaluation prompt

Comparison prompt (for identifying model uncertainty)

What works and what doesn’t

Open-source tools

aai-cli

prompt-seeker

Conclusion