Industry

How to evaluate AI models and systems: Why objective benchmarks are important

In this article, hear from AssemblyAI’s CEO, research and solutions team about the role of objective benchmarks in fairly evaluating AI models and learn how to evaluate AI systems on your own.

How to evaluate AI models and systems: Why objective benchmarks are important

The artificial intelligence industry is expected to become a trillion-dollar market in less than a decade, changing the way we learn, work, and interact with technology and people every day. 

As new AI technology continues to impact our businesses and daily lives, it’s clear there’s little guidance available on how to evaluate AI systems to choose the best option for your needs—and knowing how to conduct evaluations properly ensures that the AI models you choose is as close to human-level performance as possible. 

At AssemblyAI, we believe that AI systems are only as good as the benchmarks and evaluations they are measured against. Without consistent benchmarks, evaluators may view skewed results and risk having no range of confidence for how the model can be expected to perform in production—leading to a poor user experience. Objective evaluations and benchmarks validate systems' performances, so users know the AI models they’re using will solve real-world challenges with real-world applications. 

Objective benchmarks provide a clear, unbiased yardstick for comparing different AI solutions so individuals can understand the performance of AI models based on different tasks. This transparency demystifies AI capabilities, aligns expectations, and promotes informed decision-making. 

Third-party objective organizations are necessary to conduct evaluations and benchmarks of AI systems that are also reproducible. These independent bodies share their methodology and provide an unbiased apples-to-apples comparison of technologies, helping buyers, builders, and investors validate claims and leverage the technology that aligns with their goals. 

"There are a lot of inconsistencies with methodology (how models are evaluated against different data sets and how the data sets are curated) between public datasets and private, real-world data," shares Dylan Fox, AssemblyAI Founder & CEO. "We always risk optimizing toward our internal datasets, inadvertently overfitting, so having an independent body to oversee and verify AI benchmarks using open-source data sets is essential."

What is an objective organization?

Objective organizations ensure that evaluations are conducted impartially and with scientific integrity so that the public has access to datasets they can trust. 

An objective organization performs assessments with an impartial, independent approach, free from commercial pressures and conflicts of interest. They provide a fair, unbiased evaluation of AI systems based on their performance in standardized tests (rather than promotional or internal data).

According to research lead Luka Chketiani, "An objective organization for this would need to be competent with no reason to be subjective. They'd need to be willing to contribute to the growth of the domain by providing truthful evaluation results."

Characteristics of an objective organization

  • Independence: The organization should have no financial or collaborative ties with AI developers whose systems they evaluate. This independence prevents any potential conflicts of interest that could bias the evaluation results.
  • Expertise: The organization should be staffed by experts in AI and related fields to guarantee the evaluations are based on the latest scientific standards and methodologies.
  • Transparency: The organization should operate transparently and provide clear documentation of its testing methodologies, data handling practices, and evaluation criteria. This transparency helps stakeholders understand and trust the evaluation process and results.

Why don't third-party organizations already exist?

It’s difficult work because setting up third-party evaluations isn't straightforward. It demands regular updates to keep up with how quickly AI is evolving. 

"It's not easy to set up benchmarking pipelines that conduct fair evaluations. Models change, API schemas change, and you really need to be on top of the space to do this right," says Sam Flamini, former senior solutions architect at AssemblyAI. "You also need to be smart about how you select your datasets during evaluation."

Another barrier is funding. "Ideally, you would want this type of organization to be backed and run by objective and experienced scientists," says Chketiani. But expert AI scientists and computing power to run these evaluations need resources. 

"Providing the resources necessary to incentivize building and running this type of organization would be the first step to attracting such individuals," Chketiani adds.

While there are only a few unbiased third-party organizations evaluating AI systems right now, like Artificial Analysis, more will emerge soon.

"There is demand for this as a service, and I expect that we'll see organizations pop up that effectively serve as the 'G2' for AI models," says Flamini. "The difference between traditional SaaS reviews and AI model analysis is that you can get more objective data for AI models and need to do more continuous evaluation. You don't need to source reviews—you just need to be smart about dataset selection, build good pipelines, and update things as new models are released."

In the meantime, it’s helpful to look at evaluations from existing objective organizations as much as possible or run the evaluations yourself to make the most informed decision.

Evaluating AI models and systems: metrics to consider

Many developers and teams continue to examine a wide range of metrics and run their own benchmarks to evaluate AI systems as a whole or to evaluate specific AI models within a broader system—like automatic speech recognition (or speech-to-text) models. 

"Speech recognition systems are just tools for different organizations with different needs," explains Chkhetiani. "Measuring system performance with various metrics provides users with meaningful information to decide if a specific provider meets their needs."

With so many metrics to look at, what should users prioritize? 

It’s important to note that metrics to consider depend on what you’re evaluating.

Here are some examples:

How to evaluate speech-to-text AI models When evaluating ASR or speech-to-text models, for example, here are some metrics to consider: 

  1. Word Error Rate (WER): The most common metric for evaluating the accuracy of speech recognition models. It measures the percentage of errors in transcription compared to a reference transcript. WER is important in scenarios (like legal or medical transcription services) where precise, verbatim transcription is non-negotiable.
  2. Character Error Rate (CER): This metric is useful for languages where character-level precision is important—such as Chinese or Japanese—which don't use spaces between words. CER measures the percentage of individual character errors to provide insights into the transcription accuracy at a more granular level than WER.
  3. Real-Time Factor (RTF): RTF measures how quickly a speech recognition system can process speech in real-time. Some applications require rapid response times from speech recognition systems—such as interactive voice response (IVR) systems used in customer service. 

Learn more about how to evaluate Speech AI models here. 

Different applications might prioritize different aspects of a speech recognition system's performance. For instance, "an organization that is using speech recognition results to create tickets from customer service phone calls would care about how well the models perform on alphanumerics (phone numbers, flight numbers, etc.) and proper nouns (USA, Mercedes, etc.)," says Chkhetiani.

For these cases, a metric that reports error rates on proper nouns and alphanumerics provides valuable information that WER might not fully capture.

Beyond quantitative metrics, qualitative evaluations can also help understand the performance of speech recognition models. These evaluations can capture nuances that quantitative metrics may overlook, such as how natural and understandable the transcribed text feels to human listeners.

How to evaluate LLMs

When evaluating Natural Language Processing (NLP) models, specifically Large Language Models (LLMs) it’s important to:

  1. Use automatic metrics to target the specific end task you’re evaluating. (See more on this in the “Example evaluation metrics” section below.) 
  2. Have a QA process that ensures humans are directly examining and assessing the performance of LLMs by closely looking at the LLM outputs to determine accuracy, etc. This QA process should follow the quantitative evaluation. A QA process is necessary because it can be difficult to thoroughly evaluate LLMs when quantitative metrics do not always match human judgment. There is a scarce correlation of automatic metrics with human judgment which is especially true when the task you’re evaluating is more complex. 

By doing both a quantitative and qualitative analysis, you can be more confident in the accuracy of the overall evaluation.  

Most recently, the research community is starting to use LLMs (e.g., GPT-4, Claude-3) to run qualitative evaluations quantitatively. This means an LLM is used as a proxy, where a person feeds the evaluation criteria to the LLM, so the LLM can make a better evaluation that aligns with real human judgment. 

For example: 

You will be given a question and two answers to the same question, produced
by two different systems.

Your task is to choose which answer best fulfills the user concerns
expressed in the question. The two answers must be judged based on the
following aspects:

 - Semantic correctness: how well the answer satisfies all user concerns
 - Syntactic correctness: how good the answer is structured from a
 syntactic perspective
 - Formatting goodness: how good the answer looks and how well the system
 uses punctuation marks and casing to emphasize the concepts expressed.
 
Make sure you only answer with the most appropriate answer, with no
additional text: Answer 1 or Answer 2.

Now here are the questions and answers:
Question: {question}
Answer 1: {answer 1}
Answer 2: {answer 2}

Please choose the most appropriate answer.

Research shows that this approach correlates much better than legacy automatic metrics with human evaluation

Example evaluation metrics for specific LLM tasks 

Task

Application

Evaluation Metrics

Text Classification

Sentiment Analysis, Topic Classification, Multiple-choice Question/Answering

Precision

Recall

F1 score

Accuracy

Text Generation

Summarization, Question/Answering, Translation

BLEU, Rouge, BERTscore

Token Classification

Named Entity Recognition,

Part-of-Speech Tagging

Span-based Precision

Recall 

F1 score

*This table is not an exhaustive list of evaluation metrics to consider.

Remember that evaluating NLP models like LLMs means the metrics to consider are specific to the task you’re examining. This means evaluation metrics for specific tasks vary. The above table shows different evaluation metrics based on three different tasks.

How to evaluate AI models & systems on your own

If you choose to conduct an independent analysis and evaluate AI models on your own, here’s what to consider. 

Decide what matters most to you

Start by defining the key performance indicators (KPIs) most relevant to your business needs. Fox recommends that organizations "first ask what they care about and what they want to optimize for." Some businesses might prioritize accuracy and speed rather than the ability to handle specific jargon or dialects. When building speech-to-text applications, for example, you might care more about capturing proper nouns than real-time transcription.

Set up a testing framework

Test each of the solutions to see which performs best according to your needs. To do this fairly, you'll need a testing framework. "You want to collect data, write a testing framework, and calculate metrics that capture the quality that you care about," says Fox. This involves gathering a dataset that reflects the real-world scenarios in which the AI system will operate.

Evaluate and A/B test different models

A/B testing can help compare different AI models or versions in a live environment. This provides clear insights into which model performs better in real-world conditions (which is ultimately what you care about).

Watch out for common pitfalls

Using testing data that is not relevant to your application domain can skew results. You should also avoid relying on public datasets that AI models frequently use to train because public datasets are often more academic, making the training data less useful for real-world applications.

To accurately evaluate AI systems, objective benchmarks are key to ensuring that the performance is as close to human-level performance as possible. Users can confidently make informed decisions by referring to transparent, unbiased assessments from third-party organizations or internal assessments made with a consistent evaluation methodology. 

In the absence of these options, it's important to examine organizations' self-reported numbers. Pay close attention to self-reported evaluation pipelines and stay educated on what proper, objective evaluations look like, so you can determine if an organization is an objective organization, transparent about their evaluation method, and that their reported numbers are accurate. 

We stress the importance of independent evaluations at AssemblyAI, emphasizing the necessity of standardized, reproducible methodologies. As AI technology continues to evolve rapidly, the demand for reliable, impartial benchmarks will only increase. Embracing these practices drives the AI industry toward greater innovation and accountability. Ultimately, objective benchmarks empower stakeholders to choose the best AI solutions, driving meaningful advancements in any domain that uses these solutions.

Explore AssemblyAI's latest benchmarks here.

Disclaimer: This article focuses on how to evaluate Speech AI systems and is not a comprehensive resource on how to evaluate all AI systems. We acknowledge that the term “AI systems” is applicable across modalities including speech, text, image, video, etc. and each space has its own ways of evaluating models.