Evaluating Pre-recorded STT models
Evaluating Pre-recorded STT models
Evaluating Pre-recorded STT models
The high level objective of a pre-recorded STT model evaluation is to answer the question: Which Speech-to-text model is the best for my product?
This guide provides a step-by-step framework for evaluating and benchmarking pre-recorded Speech-to-text models, with specific guidance for evaluating Universal-3 Pro and its prompting capabilities.
Need help with evaluations or prompt optimization? Contact our Sales team — we can help you design an evaluation, optimize prompts for your audio, and benchmark against your ground truth data.
This formula takes the number of Substitutions (S), Deletions (D), and Insertions (I), and divides their sum by the Total Number of Words in the ground truth transcript (N).
While WER calculation may seem simple, it requires a methodical granular approach and reliable reference data. Word Error Rate can tell you how “different” the automatic transcription was compared to the human transcription, and generally, this is a reliable metric to determine how “good” a transcription is. For more info on WER as a metric, read Dylan Fox’s blog post here.
cpWER is similar to WER, but it also measures the number of errors a speech recognition model makes where words with incorrectly-ascribed speakers are considered to be incorrect. The primary difference from standard WER is how S is calculated: counts both word substitutions and correctly transcribed words that are assigned to the wrong speaker. A correct word with an incorrect speaker label counts as a substitution error, thereby penalizing both transcription and speaker diarization mistakes.
F-WER is similar to WER but F-WER does not apply text normalization, so all formatting differences are accounted for, in addition to word differences when computing the WER. Therefore, F-WER is always higher than or equal to WER.
The Sentence Error Rate (SER) is the ratio of the number of sentences with one or more errors to the total number of sentences.
This formula takes the duration of non-speech incorrectly classified as speech (false alarm), the duration of speech incorrectly classified as non-speech (missed detection), the duration of speaker confusion (confusion), and divides the sum over the total speech duration.
Fundamentally, MER is a negative recall rate computed for specified target entities. It is defined as the number of correctly transcribed entities relative to their total occurrence count. It accounts for multiple occurrences of the same entity and their positions within the hypothesis transcription. Our Research team proposes this as the best metric to measure the effectiveness of word boost.
For a simpler approach when evaluating high-stakes entities (such as credit card numbers, names, or dosages), consider a binary pass/fail metric per file: score each file as 1 if the model captured the target entity correctly, or 0 if it did not. Run this across 100 or more files for statistical reliability. This is especially useful when you need a quick signal on entity accuracy at scale without computing full MER breakdowns.
Universal-3 Pro is significantly more capable than prior models, and traditional WER alone may not fully capture its performance. The following metrics provide a more complete picture.
Traditional WER treats every difference between the model output and a reference transcript as an error—even when the difference is semantically equivalent. Semantic WER corrects this by normalizing equivalent words and phrases before calculating WER, so that differences like dr. vs doctor or 1300 vs thirteen hundred aren’t counted as errors.
At its simplest, Semantic WER is a preprocessing step. Before running standard WER, apply find-and-replace rules to both the reference and hypothesis transcripts:
1300 → thirteen hundred, $5 → five dollarsdr. → doctor, mr. → mister, govt → governmentgonna → going to, can't → cannotgrey → gray, cancelled → canceledum, uh, you know from both sides (or keep both—just be consistent)This alone eliminates a significant portion of false errors and can be implemented in a few lines of Python. No model inference required.
For cases where simple rules can’t capture the nuance—was an omission meaningful? Is a proper noun misspelling close enough?—an LLM can perform word-level alignment and classify each difference by severity:
These approaches are particularly valuable for Universal-3 Pro because the model often transcribes audio more accurately than human transcribers, producing differences that are correct but would be penalized by traditional WER. For an implementation of Semantic WER using Bayesian optimization, see prompt-seeker.
LASER is a published LLM-based evaluation metric (Parulekar & Jyothi, EMNLP 2025) that uses an LLM prompt with detailed examples to classify ASR errors and compute a score:
The LLM aligns each word in the ASR output against the reference transcription and assigns a penalty per word pair:
LASER provides structured per-error feedback alongside the score. This makes it useful for prompt optimization workflows where you need to understand why a prompt performed poorly, not just how much error there was. For an implementation of LASER scoring, see aai-cli.
Traditional WER treats every difference between the model output and human transcription as an error. Universal-3 Pro’s contextual awareness means it will often transcribe words that human transcribers missed entirely. In traditional WER, these show up as insertions (penalized errors), even though the model is correct. This makes WER an unreliable metric when used alone — your evaluation is only as good as your ground truth labels.
WER is only as good as your ground truth labels. Human transcriptions contain systematic errors — missed filler words, incorrect proper nouns, simplified speech patterns, and translated code-switching. When Universal-3 Pro transcribes audio more accurately than the human label, those improvements show up as WER errors.
Before reporting WER, manually audit at least 20 insertions to determine what percentage are true errors versus ground truth omissions. In our testing, the majority of insertions were cases where Universal-3 Pro correctly transcribed audio that the human transcriber missed.
This is why Artificial Analysis, an independent AI benchmarking organization, had to create proprietary evaluation datasets with manually corrected ground truths when building their Speech-to-Text leaderboard. Existing public datasets contain systematic human transcription errors that penalize models which are actually more accurate.
This section provides a step-by-step guide on how to run an evaluation. The evaluation process should closely match your production environment, including the files you intend to transcribe, the model you intend to use, and the settings applied to those models.
Ensure that the files you use to benchmark are representative of the files you plan to use in production. For example, if you plan to transcribe meetings, gather a set of meeting recordings. If you plan to transcribe phone calls, focus on finding phone calls that match your customer base’s language and region.
We recommend using at least 25 files that are representative of your use case. Length is less important than diversity of audio conditions — a good evaluation set covers the range of speakers, accents, audio quality, and vocabulary your model will encounter in production.
Then, gather human-labeled data to act as your source of ground truth. Ground truth is accurately transcribed audio data that will serve as the “correct answer” for our benchmark. Human-labeled data can be purchased from an external vendor or created manually.
Open-source audio corpora (for example, datasets on Hugging Face) can serve as a starting point for building ground truth, but they require review and correction before use in production evaluations. These datasets contain the same systematic human transcription errors described below — missing filler words, incorrect proper nouns, and simplified speech patterns — and should be audited against your actual audio before benchmarking.
The quality of your ground truth data directly affects the reliability of your evaluation. With Universal-3 Pro, this is more important than ever because the model frequently outperforms human transcribers.
Common issues with ground truth data:
um, uh, like, and other disfluenciesBefore running evaluations, audit a sample of your ground truth files by listening to the audio and comparing. If your ground truth contains systematic errors, your WER numbers will be misleading. To inspect and correct issues in your ground truth files, use the Truth File Corrector in the AssemblyAI Dashboard (found at the bottom of the left sidebar), which lets you listen back to audio and fix human transcription errors by clicking through differences.
See this article to learn more about why your word error rate (WER) benchmark might be lying to you.
A prompt that performs well overall may underperform on specific audio types. Include a diverse mix of audio in your evaluation set and track per-dataset breakdowns:
Before optimizing prompts, measure your baseline performance by transcribing your evaluation set with Universal-3 Pro and no custom prompt. The built-in default is already applied when prompt is omitted, and it outperforms most custom prompts. Record both WER and Semantic WER so you can track improvements as you layer instructions on top.
Transcribe your files using AssemblyAI’s API with Universal-3 Pro and your candidate prompts.
When crafting evaluation prompts, use the prompting guide as a reference. Key principles:
Mandatory:, Required:, and Always: than soft language like try to or pleaseThis is a doctor-patient visit. Prioritize accurately transcribing medications and diseases wherever possible. is far more effective than This is a doctor-patient visit.Before calculating WER metrics, both reference (ground truth) and hypothesis (model generated) texts need to be normalized to ensure a fair comparison.
This accounts for differences in:
Normalization can be done with a library like Whisper Normalizer.
If you are prompting Universal-3 Pro to include [unclear] or [masked] tags
for uncertain audio, ensure your normalizer strips these tags before computing
WER. Otherwise, they will be counted as insertions.
Calculate the error rates using the formulas above or consider using a library like jiwer. For Semantic WER, apply text normalization replacements before calculating WER. For LASER scoring, use an LLM-based evaluator (see Open-source tools below).
When reviewing results:
Quantitative metrics don’t capture everything. Qualitative analysis helps you identify differences between STT providers that metrics might miss — for example, how certain key terms are transcribed can make or break a transcript, even if the rest of the transcript has a lower overall error rate.
Qualitative analysis is also useful for tie-breaking when benchmarking metrics don’t clearly favor one model over another. Since you’re comparing models against each other, ground truth files aren’t required.
Side-by-side comparison: Have users compare and pick their preferred transcript between two formatted outputs from different STT providers. Tools like Diffchecker or any side-by-side interface work well for this.
LLM as judge: An LLM can automatically identify differences between two transcriptions and pick a winner. However, be cautious: an LLM judge can be misled by outputs that look correct but contain subtle errors (such as translated code-switching segments that read well in English but don’t reflect what was actually spoken). Always pair LLM-based judgments with spot-checking against the actual audio.
A/B testing in production: Serve transcripts from different providers to users and collect feedback. You can ask users to score transcripts directly, or track indirect signals like the number of support ticket complaints about transcription quality.
WER is not always the right primary metric. Some domains prioritize output qualities that traditional accuracy metrics do not capture:
When running evaluations for domain-specific use cases, define your success criteria before choosing metrics. If your end users care about readability and preference, include qualitative evaluation (side-by-side comparisons, user preference scoring) alongside quantitative metrics.
Finding the optimal prompt for your use case is an iterative process. There are two main approaches:
For large-scale prompt optimization, consider using one of the open-source tools described below. These tools systematically test prompt component combinations and score them against your evaluation data, converging on the best prompt for your specific audio.
Use these prompts directly from the prompting guide as your evaluation prompts.
Start with the built-in default (Best all around). Omit the prompt parameter to use it — you don’t need to set it explicitly:
For maximum verbatim capture and multilingual code-switching, use the Verbatim with multilingual support prompt instead. The trade-off is that the model may occasionally hallucinate disfluencies or language switches that don’t exist in the audio.
This is the Handling unclear audio with [unclear] prompt. Run it alongside the evaluation prompt on the same audio and diff the outputs to find where the model is guessing:
By comparing the two outputs, you can identify exactly which segments the model is least confident about. This is useful for:
The authoritative list lives in the prompting guide — see What works / what to avoid. The same rules apply when building evaluation prompts: lead with Transcribe…, use authoritative language (Required:, Mandatory:, Always:), describe the pattern to watch for, and add instructions one at a time.
Listing specific word examples in your prompt causes hallucinations. The model
becomes over-eager to insert those exact words into the transcript, even when
they weren’t spoken. For example, Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman) will cause the model
to hallucinate those drug names. Instead, describe the pattern of entities
to prioritize: Pharmaceutical accuracy required across all medications and drug names. If you already know the specific terms, use keyterms
prompting instead — it’s
optimized for term boosting and more reliable than describing terms in a
free-form prompt. See the prompting
guide for
more details.
aai-cli is a command-line tool for evaluating and optimizing transcription prompts. It supports:
prompt-seeker uses Bayesian optimization (Optuna TPE) to systematically find the best transcription prompt by testing component combinations across diverse audio datasets and scoring with Semantic WER. It supports:
Both tools require ground truth transcriptions for scoring. If you don’t have ground truth yet, transcribe a sample of your audio manually and use that as your starting point.
Evaluating Universal-3 Pro requires going beyond traditional WER. The model’s contextual awareness and prompting capabilities mean that evaluation is as much about finding the right prompt as it is about measuring accuracy. Use Semantic WER or LASER alongside traditional WER, audit your ground truth data carefully, and iterate on prompts systematically to find the best configuration for your audio.