Resources Article LLM Benchmarks: Guide to Evaluating Language Models

LLM Benchmarks: Guide to Evaluating Language Models

Jason D. Rowley

Published on 08/09/23Updated on 11/06/23

Table of Contents

What Even Is an LLM Benchmark?A Brief History of AI and LLM Benchmarks 1960s-1970s: Early Machine Translation 1980s-1990s: Bag-of-Words Models 2000s: Sequence Models and Named Entity Recognition Early 2010s: Word Embeddings Mid 2010s: Attention Models and Question Answering Late 2010s: GLUE and SuperGLUE 2020s: Expanded Capabilities, Ethics, and Explainability What Benchmarks Don’t Measure Deepgram Articles Covering LLM Benchmarks API-Bank: Benchmarking Language Models’ Tool Use The ARC Benchmark: Evaluating LLMs' Reasoning Abilities HellaSwag: Understanding the LLM Benchmark for Commonsense Reasoning HumanEval: Decoding the LLM Benchmark for Code Generation MMLU: Better Benchmarking for LLM Language Understanding SuperGLUE: Understanding a Sticky Benchmark for LLMs Parsing Fact From Fiction: Benchmarking LLM Accuracy With TruthfulQA

Share this guide

Humans regularly take standardized tests; they are imperfect but often point us in the right direction when it comes to understanding what a human could excel in, or at least give us an idea about how well people digest and retain information.

For example, the U.S. Scholastic Aptitude Test (SAT) and the American College Testing (ACT) exam aim to measure a high school student’s likelihood of success in a rigorous higher education environment. The Graduate Record Exam (GRE) seeks to assess a college graduate’s chances of doing well in grad school. The Bar Exam ensures that would-be practitioners of the legal arts are, well, sufficiently practiced to engage in the art and business of law. Doctors have to pass Board Exams, at first and then on a routine basis, to ensure that they’re keeping up with the current developments around their particular practice of medicine. In fields ranging from financial advisory to cheese-tasting, there are countless standardized ways for humans to test their mettle.

What else do these tests have in common? GPT-4 and other large language models can do quite well on them, better than most humans in some cases.

Again, are any of these standardized tests of human knowledge and raw cognitive horsepower perfect? Of course not. Are better results on these tests, in general, correlated with a greater capacity to learn a new field or—indeed—to simply validate the base-level competence required to practice within high-stakes fields like medicine and law? Sure, mostly, albeit with some outliers. No test is perfect, and neither are we.

But that hasn’t stopped folks from coming up with ever-more intricate standardized tests of AI performance in service of being able to objectively compare the capabilities of one piece of software to another.

To be clear, we’re in the early stages of what promises to be a transformative Language AI boom, which prompts the following questions: What, exactly, are we expecting out of language models? And, given the expansive field of language models, which player is “the best”? Or, to put a finer point on it, is there even such a thing as “the best”?

The short answer to these questions is: Well, it depends. The longer answer is what follows. This article also serves as the central “hub” page, connecting (and collecting) an ever-growing number of reports and articles highlighting specific LLM benchmarks. That collection of articles can be found at the bottom of this page.

Ready? Let’s go.

What Even Is an LLM Benchmark?

In this context, a benchmark is just a standardized software performance test. It’s just that the software in question is an AI language model. It follows, then, that most of these language model benchmarks are based on completing specific natural language processing (NLP) tasks.

By creating a common set of tests and sample data to measure the performance of one language model versus another, developers and the broader user community get quantitative data on the capabilities of any given model, at least when it comes to tackling a particular set of tasks.

Rather than relying on subjective factors like the “look and feel” of model outputs, a well-structured benchmark enables an objective assessment of model performance, thus removing a fair bit of human bias in the process. Put simply: developers and everyday users can pick the model(s) best suited to their particular needs based on an apples-to-apples comparison of competing model providers or even different versions of the same base model (i.e., 7B vs. 13B parameters, generic pre-trained vs. fine-tuned for chat, etc.).

Benchmarks are valuable to developers and consumers because they provide an objective measure to make buying and implementation decisions. But their value to the AI research community may be even greater. After all, if a company is in the business (or study) of moving the field forward, there must be some measurable consensus around the current state of the art (SOTA). Benchmarks provide insights into areas where a model excels and tasks where the model struggles.

With the increasing use of LLMs in various sectors, from customer service to code generation, the need for clear, understandable performance metrics is paramount. Benchmarks offer this transparency, clearly showing what users can expect from a given model.

A Brief History of AI and LLM Benchmarks

The artificial intelligence field as we know it traces its intellectual roots all the way back to the mid-1950s. Keep in mind: the 1956 Dartmouth paper proposing a Summer Research Project on Artificial Intelligence—in which the term “artificial intelligence” first appeared, alongside concepts like “neuron nets”, computational use of human language, and computational self-improvement—was published just over a decade after ENIAC ushered in the digital computing era.

It took decades for AI practice to catch up with AI theory. But, now backed by a tidal wave of capital and astonishing advancements in computing hardware, it did catch up. But before we can talk about the recent history of LLM benchmarks, we have to start with the earlier days of the NLP field.

1960s-1970s: Early Machine Translation

One of the earliest forms of NLP, machine translation, dates back to the 1960s and 1970s. These early systems were rules-based, and their evaluation was often done manually. The Automatic Evaluation of Machine Translation (BLEU) benchmark wasn't introduced until 2002.

1980s-1990s: Bag-of-Words Models

As statistical methods began to be applied to NLP, benchmarks based on tasks like text classification and information retrieval emerged. These tasks often used a 'bag of words’ approach, ignoring word order and other syntactic information. Precision, recall, and F1 score were standard metrics.

2000s: Sequence Models and Named Entity Recognition

More advanced statistical models, like Hidden Markov Models and later Conditional Random Fields, led to benchmarks for tasks like part-of-speech tagging and named entity recognition. The CoNLL shared tasks are an example of a competition that sets benchmarks in this era.

Early 2010s: Word Embeddings

The introduction of word embeddings (like Word2Vec in 2013 and GloVe in 2014) allowed for more sophisticated representations of word meaning. Benchmarks for these models included similarity and analogy tasks, measuring how well the embeddings captured semantic and syntactic relationships.

Mid 2010s: Attention Models and Question Answering

The advent of neural network-based models, culminating in the transformer architecture in 2017, led to a new era of NLP benchmarks. Tasks included machine translation (BLEU) and sentiment analysis (accuracy on IMDB or SST-2 datasets), among others.

As the size of training datasets and language model parameter counts increased, it became clear that LMs could field answers to questions posed by users. The Stanford Question Answering Dataset (SQuAD) was not the first question-answering benchmark, but it was comprehensive. To quote the original paper: SQuAD is a set of “100,000+ questions posed by crowd workers on a collection of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.“ This expanse of standardized data helped SQuAD become a bluechip benchmark for LMs’ ability to answer questions and served as the blueprint for future datasets aiming to measure proficiency at even more niche question-answering tasks.

Late 2010s: GLUE and SuperGLUE

These benchmarks, introduced in 2018 and 2019, respectively, consist of multiple NLP tasks and aim to measure a model's overall language understanding capability.

The General Language Understanding Evaluation (GLUE) benchmark consists of various natural language understanding tasks, including sentiment analysis, paraphrase detection, and more. Each task came from a different existing dataset. Models were evaluated based on their average performance across all tasks.

SuperGLUE, a more challenging benchmark, was introduced following the success of transformer-based models on the GLUE benchmark. SuperGLUE includes tasks that require more complex understanding, such as reading comprehension, commonsense reasoning, and more.

2020s: Expanded Capabilities, Ethics, and Explainability

And now we arrive at the contemporary era, one dominated by well-resourced commercially available models—think: GPT-3 and its chatty cousins, Bard, Claude, and others—on the one side, and wave after successive wave of open source contenders angling to give the likes of OpenAI, Google, Anthropic, and others a run for their money.

These models can be pretty big. GPT-4, for instance, is estimated to have over 1.7 trillion parameters… tenfold GPT-3’s parameter count. And as a general rule, larger models exhibit more capabilities than smaller ones. Suffice it to say that modern LLMs are capable of a lot more than basic named entity recognition, translation, and question-answering tasks. They’re writing code, passing professional examinations with flying colors, and helping people be more productive.

As language models have become more powerful, there's an increasing focus on benchmarks that measure both performance and ethical aspects like fairness and bias. There's also interest in explainability, or how well a model can provide understandable reasons for its outputs. The landscape of benchmarks covering these emerging areas of inquiry is rapidly evolving.

It bears repeating that at the time of writing, we’re about a year into the Generative AI boom, and the pace of innovation is (almost) impossible to follow. Since future capabilities of language models (large, small, or otherwise) are difficult to predict, making guesses about the future of benchmarking model performance is perhaps even more of a challenge. This said, as artificial intelligence grows increasingly “intelligent,” rest assured that for every new capability that emerges, there will probably be a new benchmark that sets the standard.

What Benchmarks Don’t Measure

The closest analogy to AI benchmarks is the standardized testing most folks reading this are all too familiar with. And much like humans, a language model will perform differently across a battery of benchmark tests.

For instance, a model may perform exceptionally well on one benchmark emphasizing grammatical understanding but struggle with another requiring more advanced semantic knowledge. This discrepancy can arise due to the different focuses of each benchmark test, which are designed to assess various aspects of a model's capabilities.

Performance is also a function of the model’s training data and the extent to which it has been fine-tuned to a particular domain of knowledge. One model may be exceedingly good at answering medical questions, but it barely exceeds “Hello World” in a Python programming test. A model trained on millions of pages worth of legal text may exhibit the capacity to extrapolate the logical structure of a contract but utterly fail at writing poetry in iambic pentameter.

This variation in performance across benchmarks highlights the importance of comprehensive benchmarking. No single test can fully capture an LLM's wide array of abilities and potential weaknesses. A combination of benchmarks, each testing different skills, provides a more complete picture of a language model's capabilities.

In sum, the role of benchmarks in assessing and improving LLMs cannot be overstated. They solve critical problems of performance evaluation, model development guidance, and transparency. They provide a multi-faceted perspective on LLMs, contributing significantly to advancing and understanding these complex AI systems.

Deepgram Articles Covering LLM Benchmarks

Below, we’ve included a list of articles published by Deepgram that explain the basics of various benchmarks. It’s alphabetical, based on the name of the particular benchmark. And as new benchmarks emerge, we’ll be sure to cover them and update the list below with new entries.

API-Bank: Benchmarking Language Models’ Tool Use

API-Bank introduces a novel approach to benchmarking Large Language Models' (LLMs) ability to integrate and utilize external tools, primarily APIs. Recognizing that foundational LLMs often lack specialized knowledge, API-Bank tests their capability to leverage APIs to enhance their performance. This benchmark evaluates LLMs on their decision-making in API calls, their proficiency in selecting the right tool for a given task, and their capacity to employ multiple APIs to fulfill broad user requests. Developed by Li et al. in April 2023, API-Bank sets the stage for future benchmarks, emphasizing the potential of LLMs when paired with external tools. Dive into the intricacies of this pioneering benchmark in Brad Nikkel's article.

The ARC Benchmark: Evaluating LLMs' Reasoning Abilities

The AI2 Reasoning Challenge (ARC) was conceived by Clark et al. in 2018 as a rigorous test for Large Language Models' (LLMs) question-answering capabilities. Unlike previous benchmarks that focused on simple fact retrieval, ARC emphasizes the integration of information across multiple sentences to answer complex questions. Comprising 7787 science questions designed for standardized tests, ARC is split into an "Easy Set" and a more challenging "Challenge Set." The benchmark aims to push LLMs beyond mere pattern matching, promoting human-like reading comprehension. Discover the significance of ARC and its role in advancing LLMs' question-answering abilities in Brad Nikkel's article on the topic.

HellaSwag: Understanding the LLM Benchmark for Commonsense Reasoning

HellaSwag, an acronym for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations, was introduced by Zellers et al. in 2019 to evaluate LLMs on commonsense natural language inference about physical situations. The benchmark employs "Adversarial Filtering" to generate challenging incorrect answers, making it tough for LLMs that rely heavily on probabilities. While humans consistently scored above 95% on HellaSwag, initial state-of-the-art models struggled, achieving accuracies below 50%. The benchmark emphasizes the importance of commonsense reasoning in LLMs and pushes for evolving benchmarks as LLMs improve. Delve into the intricacies of this commonsense reasoning benchmark in Brad Nikkel's guide.

HumanEval: Decoding the LLM Benchmark for Code Generation

The HumanEval dataset and pass@k metric have revolutionized LLM evaluation in code generation. Moving beyond traditional text similarity measures like BLEU, HumanEval focuses on the functional correctness of generated code. With 164 hand-crafted programming challenges, it offers a rigorous assessment of LLMs' coding capabilities. This benchmark underscores the shift towards evaluating AI's problem-solving efficiency over mere text mimicry. Explore this transformative approach in Zian (Andy) Wang's article.

MMLU: Better Benchmarking for LLM Language Understanding

In response to LLMs quickly surpassing benchmarks like GLUE and SuperGLUE, Hendrycks et al. introduced the Measuring Massive Multitask Language Understanding (MMLU) benchmark. MMLU offers a broader evaluation of LLMs, assessing their understanding across diverse subjects, from humanities to hard sciences, at varying depths. Unique to MMLU, it tests specialized knowledge beyond elementary levels. With a repository of over 15,908 questions spanning 57 subjects, MMLU provides a comprehensive assessment, revealing areas where specific LLMs excel or lag. Dive into this multifaceted benchmark and its implications in Brad Nikkel's detailed guide.

SuperGLUE: Understanding a Sticky Benchmark for LLMs

GLUE, launched in 2019, was a groundbreaking benchmark for measuring language model proficiency across varied tasks. As Large Language Models advanced, SuperGLUE emerged, elevating the evaluation standards with more challenging tasks. For AI researchers, SuperGLUE provides a rigorous tool to gauge model understanding and fairness, ensuring that AI systems not only understand language nuances but also mitigate biases—critical for both innovation and real-world application. Learn more about this important LLM metric in this article by Zian (Andy) Wang.

Parsing Fact From Fiction: Benchmarking LLM Accuracy With TruthfulQA

In the realm of LLMs, accuracy is paramount, but so is truthfulness. TruthfulQA, designed by Lin et al. in 2021, aims to benchmark the veracity of LLMs' answers. While LLMs can generate convincing responses, their propensity to produce falsehoods, especially in larger models, poses challenges. TruthfulQA's unique approach tests LLMs with questions designed to elicit imitative falsehoods, emphasizing the importance of truth over mere relevance. The benchmark reveals that while some progress has been made, LLMs still have a considerable journey ahead in consistently generating both truthful and informative answers. Dive into the nuances of this innovative benchmark in Brad Nikkel's comprehensive guide.