Article·AI Engineering & Research·May 3, 2024

Lies, damn lies, and benchmarks

Jose Nicholas FranciscoJosh Fox
By Jose Nicholas Francisco and Josh Fox
PublishedMay 3, 2024
UpdatedJun 13, 2024

Much like doping in sports, benchmark cheating has become a common practice in today’s high tech world (heck, even outside it in a number of well-publicized cases), as competitors jockey for position on benchmarking leaderboards in endless pursuit of any marketable edge that might tip the scale (even if it means putting their thumb on said scale) and boost their business fortune.

From CPUs to GPUs to the entire smartphone industry, back to...CPUs, benchmark gamesmanship has a long and storied history that now extends well beyond traditional hardware and software domains into the present and looming next big thing—artificial intelligence.

While much has been written of late concerning the utility and reliability of LLM benchmarking leaderboards, the same raised questions are applicable in other AI domains such as voice AI. Can we trust an AI model’s benchmark? Is it a reliable evaluation metric that allows us to make critical decisions such as which model is the best to use for a particular use case? Or do we risk falling victim to Goodhart’s Law, and REALLY need to consider a bigger, multi-faceted picture to answer such questions?

We’ve personally talked a lot about benchmarks. From benchmarking open-source ASR models, to evaluating various LLM benchmarks, or even comparing our own AI models with others.

What we’ve found, however, is while there absolutely is tremendous utility in employing benchmark evaluations in the development and/or selection of different AI models, it’s of critical importance to understand their limitations. It’s fairly easy to game the system in order to produce the highest benchmark score possible, and as the competitive landscape grows along with the total addressable market size of the industry, the pressure to find and market a differentiable advantage increases by the day.

And so, because benchmark scores seemingly provide an easily understandable way to market whose model is best (See this single quantifiable number that’s bigger than that other single quantifiable number!), the temptation to “game the system” or “stack the deck” becomes all too often irresistible.

But if we can’t trust benchmarks due to data manipulation or other forms of mathematical sleight-of-hand, then how can we really know which AI models work best in practice? How can we tell which of the graphs below are trustworthy?

Let’s walk through this problem together:

How to lie when benchmarking

We’ll say it up front: Benchmarking data should closely resemble real-life data as much as possible since customers will use AI models in real-world settings. 

As many a statistician will tell you, it’s rather simple to lie about statistics. All you have to do is play around with a few numbers, or use metrics that are incongruent with what you’re measuring.

We won’t go too deep down the benchmark-cheating rabbit hole, but here’s but a small sample of notable examples in the world of tech:

  • An arXiv paper on “benchmark leakage” and other LLM cheating techniques

  • A Forbes article on MediaTek, a wireless communication hardware company that artificially pumped up their benchmark results by 30-75%

  • A video on how LLM Leaderboards are full of cheaters trying to climb up the ranks

  • MIT’s take on the story of “Machine Learning’s First Cheating Scandal”

  • A news report from 2017 on how cheaters led Google to deprecate the Octane Javascript benchmark


In the field of AI, here are a few manipulative techniques to be on the lookout for as you carefully evaluate AI model evaluations.

Tactic 1: Cherry-picking the test data

A commonly employed tactic (even in academic circles), is the selective inclusion/exclusion (i.e. cherry-picking) of specific test data to purposefully influence the results of model evaluation. When carried out with willful intent, this form of manipulation can mislead end users and developers alike by potentially exaggerating performance metrics or masking deficiencies, leading to evaluation results that may not represent real world use cases.

In the context of evaluating speech recognition models, if you only benchmark with test data consisting of perfect audio—like an audiobook read by a professional voice actor in a soundproof room on a high-tech microphone—then you’re likely to achieve a low word error rate (WER) result. However, you’ll have absolutely no idea how the model actually performs on real-world data like Zoom calls, customer service phone calls, or live streamed conference events where background noise and varying environments may have a substantial impact on a given model’s performance. Thus, as mentioned earlier, benchmarking data should closely resemble real-life data as much as possible, since customers will use AI models in real-world settings.

Tactic 2: Reporting only averages

In middle school, we learn all about averages. Averages are simple. They’re comfortable. They’re easy to calculate. Just add up all the values in a list and divide by the total number of values you have.

How could this possibly be misleading?

Well, in certain contexts, taking the arithmetic mean doesn’t make sense. For example, the average atom has around 100 protons—meaning, on average, everything is made of radioactive Fermium.

Second of all, a lot gets lost in the average. Simple “summary” statistics don’t tell the whole story. Perhaps the most famous example of stats’ shortcomings is Anscombe’s quartet: a group of four vastly different datasets whose descriptive statistics (mean, variance, line of best fit, etc.) are all equivalent.

And, of course, when data is skewed like in the graphs below, the average will be dragged towards the side of the distribution with more data points. If the data is left-skewed, the mean will be dragged to the left. And if the data is right-skewed, the mean will be dragged to the right.

In the context of AI, when measuring word error rate (WER) on a speech-to-text model, only reporting average scores is misleading. First of all, we don’t know if the WER is normally distributed across various audio files. The average result can be easily gamed simply by selectively omitting files a model performs poorly on, or similarly, adding specific files you’ve found your competitors struggle with (see also: Tactic #1).

And second, if you only know the average number of words an ASR model transcribes incorrectly, you’re limiting the amount of information you learn about the expressive power of the underlying model. Which types of audio does it perform best on? Cell phone calls? Audiobooks? Movies? What makes the model perform poorly? Background noise? Accents?

You’ll need much more than a simple mean score to truly understand how the model will perform in practice.

Tactic 3: Training with the test data

An even more egregious form of manipulation can occur when vendors selectively include the benchmark test sets themselves in the training data they use to build their models. In such a scenario a speech-to-text provider could literally train models that result in 0% WER on certain benchmarks by training on those benchmark data sets for long enough. (For more on overfitting, click here.) 

This is certainly harder to do with closed benchmark data sets, but trivial to do with open ones. Also known as benchmark leakage, it’s essentially on par with a teacher giving out part or all of the answer key to their students before their big final exam. 

Note: The arXiv paper mentioned above delves exactly into this problem and discusses how widespread it could really be, especially on LLM leaderboards.

If a model reportedly produces absolutely incredible results on an open benchmark (or any benchmark really) but performs poorly on data you’ve collected yourself, it wouldn’t be unreasonable to suspect that something is awry. Whether it’s a case of willful deceit (e.g. training on and overfitting to the benchmark data in question to achieve inflated scores) or example of a benchmark data set that is “out of distribution” with your intended problem domain, it’s imperative to question when benchmark results don’t translate to real world examples.

Spotting trustworthy benchmark results

Box Plots, not just averages

In the intricate world of data analysis and AI, the choice of statistical tools can significantly impact our understanding and interpretations. Boxplots emerge as a powerful ally, offering a richer, more detailed view of data compared to the traditional average. Here’s why:

  • Comprehensive Data Visualization: Boxplots illustrate the minimum, first quartile, median, third quartile, and maximum values of a dataset. This visualization provides a holistic view of the data's spread, central tendency, and variability at a glance, a dimension averages fail to capture.

  • Outlier Identification: One of the standout features of boxplots is their ability to make outliers explicitly visible. These points, which deviate significantly from the rest of the data, can skew averages and lead to misleading conclusions. Boxplots display these outliers as individual points, allowing for a more accurate interpretation of the data's true nature.

  • Five-Number Summary (as opposed to just one): At the heart of every boxplot is the five-number summary, a concise yet comprehensive portrayal of a dataset's distribution. This summary includes the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum values. These metrics provide a skeleton of the dataset, highlighting its range, central tendency, and variability.

  • Visual Representation of Spread: Unlike the average, which amalgamates data into a single value, boxplots visually depict the spread of the data. The distance between Q1 and Q3, known as the interquartile range (IQR), offers a clear view of the dataset's concentration, allowing for immediate identification of data dispersion.

As much as possible, Deepgram always uses box plots to showcase accuracy. The data is more interpretable, and customers are able to glean a more complete picture of every ASR provider’s performance.

On the surface, boxplots may not be as succint nor appear as clean and crisp as a single number does; however, when trying to pick an AI model that will work the best in practice, understanding real-world performance in a comprehensive way is critical.

Proper normalization strategy

Data normalization is an important practice frequently used in natural language processing (NLP) and speech recognition technologies. It's a process aimed at reducing the complexity and variability of text data, making it more digestible for algorithms and models.

By simplifying text data, normalization significantly reduces the workload on algorithms and models. Furthermore, by normalizing linguistic quirks such as non-standard words, filler words, we can ensure a proper apples-to-apples comparison across various AI models during benchmark evaluation.

For example, the precise handling of capitalization and punctuation marks is foundational for interpreting the semantic meaning in text data. This normalization aspect ensures clarity in distinguishing between proper nouns and common nouns, as well as understanding sentence boundaries in speech recognition tasks.

And in speech recognition itself, speech-to-text systems often grapple with the absence of visual cues for capitalization and punctuation. The strategy involves inferring these elements from the context and intonation, which adds a layer of complexity to accurate transcription.

After all, for your particular use case, it might not matter whether a speech-recognition model outputs “deep-fried” or “deep fried,” so there would be no reason to algorithmically punish the model that omits the hyphen.

Identifying outliers and interpreting data

Outliers—the data points that stand apart from the rest—often carry with them stories that could fundamentally alter our understanding of datasets. Boxplots often shine a spotlight on these outliers, marking them conspicuously as individual points that extend beyond the "whiskers." 

This visual cue is not merely an aesthetic choice; it serves a critical function in data analysis and AI modeling. By isolating these outliers, boxplots allow data scientists to scrutinize their impact on the dataset and consider potential biases that could skew interpretations or model predictions.

Note: Many companies may not show these outlier points on published boxplots since they can clutter the graphs, making the data too messy to read. Nevertheless, even without these outlier points, the boxplots contain significantly more information than a simple average.

  • Mitigating Misinterpretation: The influence of outliers on averages can lead to misleading conclusions. For instance, a few extremely high pollution readings could elevate the average, suggesting a more dire overall water quality situation than is typical. Boxplots, by distinguishing outliers, help prevent such over-generalizations.

  • Informed AI Modeling: In AI, understanding outliers is crucial for developing robust models. An outlier could represent an error in data collection, a rare event, or a new trend emerging. Identifying these through boxplots allows AI practitioners to make informed decisions—whether to include these outliers in model training or treat them as exceptions.

  • Bias Reduction: Recognizing and analyzing outliers is vital in reducing biases in AI models. For example, if an AI model for predicting river water quality were trained without accounting for outliers, it might fail to accurately predict extreme pollution events, which could be the most critical to detect.

  • Enhanced Data Preprocessing: The clear depiction of outliers in boxplots aids in data preprocessing for AI modeling. Decisions on data normalization, outlier removal, or feature engineering can all be informed by the patterns and anomalies highlighted in a box plot analysis.

Benchmarking on real-world data

The quest for achieving high-performing, reliable, and ethically sound AI systems has led to an increased emphasis on benchmarking these systems against real-world data, as opposed to solely relying on curated datasets. 

This shift is motivated by the need to ensure that AI technologies are capable of operating effectively and equitably in the complex, diverse, and often unpredictable environments they are deployed in.

Curated datasets often fall short in representing the complexity and variability of real-world scenarios. In domains like healthcare, not all patients are going to have ideal health. In speech-recognition, not all audios will have that crisp, crystal-clear articulation you hear in movies. And with regards to self-driving cars, not all pedestrians or vehicles will behave perfectly in the real world.

Thus, testing on real-world data becomes critical.

Part of the reason LLMs like ChatGPT are so effective is that the datasets they trained on include vast volumes of real-world data. The Pile dataset, for example, does not only include perfectly written academic papers and Pulitzer-prize winning novels. It also includes YouTube subtitles, which includes other linguistic quirks like filler words. It includes GitHub repositories—bugs and all. It includes Reddit posts up until 2020.

The point is this: We shouldn’t shun perfect audio recordings or well-written works when benchmarking. Rather, we should include them alongside real-world data, where everything is messy and noisy—just like the environments we’ll be deploying our ML models in.

Thus, each of the aspects above underscores the critical need for the AI research and development community to prioritize the integration of real-world data into their benchmarking processes. By doing so, they ensure that AI technologies are not only innovative but also ethically sound and reliable in the diverse environments they serve.

So how can we ensure that we’re getting our money’s worth with AI models?

There exist a few ways to ensure that the AI models you purchase for yourself or your company are reliable.

First of all, we strongly encourage benchmarking various models on a sample of data you’ve collected yourself. Whatever your personal use case is, take some of your collected data and input it into a few competing models of your choosing. BE CAUTIOUS WITH OPEN BENCHMARK DATA SETS AND ASSUME VENDORS MAY HAVE TRAINED ON THE VERY DATA THEY’RE TESTED AGAINST! Assuming the data you have on-hand isn’t already publicly available, you should get a much more legitimate image of how these models perform in the real world.

Most AI providers out there will at the very least provide a free trial of their models, so this mini-benchmarking method should be decently accessible to individual developers.

If you don’t see results similar to what’s implied by a particular vendor’s benchmarks, then that’s a sign their model may not deliver the performance you need.

The power of testing is in your hands, and any AI provider worth their salt should very much encourage you to put their models to the test free-of-charge using your own non-public, real-world data. That’s why whenever we publish our internally conducted benchmarks, we encourage our audience to test for themselves and validate that they see similar results across vendors. Always remember–benchmarks don’t lie, but liars use benchmarks. The key is to know what to look for to detect such deception. Be a skeptic. Don’t just trust, and always verify.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.