New large language models (LLMs) spring up on a near-weekly basis, often with accompanying bold claims about their abilities. To test these claims’ veracity, researchers carefully craft evaluation tasks (benchmarks) designed to challenge current state-of-the-art (SOTA) LLMs.

Goodhart’s law, often summed up as “When a measure becomes a target, it ceases to be a good measure," is widely applicable across domains but particularly salient toward LLMs, since machines and algorithms are now adept at recognizing and memorizing patterns at massive scale. Given Goodhart’s law, we ought to take LLM benchmarks with a grain of salt, but they’re still useful.

Because language is very multifaceted, different LLM benchmarks zero in on different language aspects, including how well LLMs answer questions, summarize text, retrieve information, analyze sentiment, and model language (among many other LLM capabilities). Since no single benchmark evaluates every aspect of language, testing LLMs on multiple benchmarks is common practice. This also prevents incentivizing LLMs to target a single benchmark, rendering that benchmark useless (by Goodhart’s law).

Hugging Face, an open-source championing artificial intelligence (AI) company, hosts a handy "Open LLM Leaderboard" that does just this, automatically evaluating open LLMs submitted to their Hub on several foundational benchmarks, measuring various reasoning and knowledge tasks in zero to 25-shot settings. Hugging Face’s four choice benchmarks are:

  1. AI2 Reasoning Challenge

  2. HellaSwag

  3. Massive Multitask Language Understanding

  4. TruthfulQA

Over a four-part series, we’ll dig into each of these benchmarks to get a sense of what exactly Hugging Face’s Open LLM Leaderboard aims to evaluate and learn about what goes into designing challenging LLM benchmarks. First up, we’ll tackle the AI2 Reasoning Challenge (ARC).

Benchmarking Question-Answering Prowess: ARC

Perhaps the bare minimum we ask of our LLMs is accurate, informative answers (to our near-endless questions if that’s not asking too much). Current LLMs accomplish this pretty well, but it wasn’t always so.

In 2018, Clark et al. conceived the AI2 Reasoning Challenge to be a more demanding “knowledge and reasoning” test than similar question-answer (QA) benchmarks (at that time), like the Stanford Question Answering Dataset (SQuAD) and Stanford Natural Language Inference Dataset (SNLI).

With ARC, Clark et al. sought to push the field beyond existing, relatively easy QA benchmarks—often focused on merely retrieving factoids from passages—toward benchmarks that measure more important QA capacities like the reasoning, commonsense knowledge, and deep comprehension skill needed to answer difficult, complex questions. Making headway in the former entails better pattern matching while making headway in the latter entails engineering LLMs with more human-like reading comprehension—a more useful capability for many applications.

To pull this off, Clark et al. generated more complex questions than previous datasets. Their ARC dataset contains 7787 non-diagram, 4-way multiple-choice science questions designed for 3rd through 9th grade-level standardized tests. These questions—derived from numerous sources and targeting various knowledge types (e.g., spatial, experimental, algebraic, process, factual, structural, definition, and purpose)—are split into an “Easy Set” and a “Challenge Set.”

The Challenge Set contains 2590 questions that at least one of a retrieval-based and a word co-occurrence algorithm—the Information Retrieval (IR) Solver and the Pointwise Mutual Information (PMI) Solver—incorrectly answered. Because IR and PMI rely heavily on words’ co-occurrences, Clark et al. assumed that the questions IR and PMI faltered on were tough enough to merit more advanced QA models, earning them their Challenge Set designation.

Below is an example Easy Set and an example Challenge Set question to give you a sense of their difference:

Image Source: An example Easy Set question from ARChttps://arxiv.org/pdf/1803.05457.pdf

Image Source: An example Easy Set question from ARChttps://arxiv.org/pdf/1803.05457.pdf

Image Source: An example Challenge Set question from ARC

Image Source: An example Challenge Set question from ARC

Clark et al. also released the ARC Corpus, which contains 14 million sentences relevant to ARC’s questions for training or fine-tuning models in the science domain. The ARC corpus offers distributed evidence, meaning it often contains most of the information needed to answer an ARC question, but that information is spread throughout a passage. This way, the corpus can indirectly help LLMs solve ARC questions while preventing them from outright memorizing answers (a feat akin to the fact retrieval that Clark et al. wanted QA models to transcend).

For scoring, each correct answer gains one point, and any k-way tie that includes the correct answer receives 1/k points. The total score is the sum of points earned divided by the number of questions.

Clark et al. tested several models (including BiDAF and DecompAttn, which were then SOA models for SQuAD) on the Easy Set and Challenge Set questions. No model tested scored significantly above random chance (25%) on the “Challenge Set” questions, while most models tested scored around 55% to 65% on the “Easy Set,” a disparity Clark et al. intended. You can see the scores for the models that Clark et al. tested below:

Image Source: Initial baseline performances on ARC questions (note that no model performed significantly above chance on the “Challenge Set” but most performed decently on the “Easy Set”)

Image Source: Initial baseline performances on ARC questions (note that no model performed significantly above chance on the “Challenge Set” but most performed decently on the “Easy Set”)

ARC’s Contribution

Since 2018 is like decades ago in the ML world, we should check how current LLMs fare. ARC maintains their own leaderboard here, and you can check out Hugging Face’s Open Leaderboard here. Falcon-40b currently leads the open LLMs at 61.9%, and Google Brain’s ST-MoE-32B leads the non-open LLMs at 86.5% (as of July 4th, 2023).

Though LLMs have seriously upped their QA game since ARC’s release, ARC continues pushing ML engineers to develop LLMs with better question-answering capabilities. What sets ARC apart and continues to make it a solid benchmark is that it shifted the focus away from testing simple fact retrieval toward evaluating how well LLMs integrate bits of information spread across multiple sentences to answer nuanced questions. But there’s so much more we can test.

Stay tuned for our next article, where we’ll dive into HellaSwag, the second benchmark on Hugging Face’s Open LLM Leaderboard. We’ll learn how HellaSwag evaluates an entirely different but equally crucial LLM capability—common sense reasoning.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo