Resources Article BIG-Bench: The Behemoth Benchmark for LLMs, Explained

BIG-Bench: The Behemoth Benchmark for LLMs, Explained

Zian (Andy) Wang

Published on 10/04/23Updated on 10/11/23

Table of Contents

The Problem with Current Benchmarks What is BigBench?The Ultimate Goal A Detailed Framework: BIG-bench API:BIG-bench Lite: The Compact Solution Evaluation Results Social Bias Discovered Through BigBench Conclusion: A Journey Beyond Benchmarking Reflections on Progress

Share this guide

The fast-evolving landscape of generative language models has redefined the boundaries of technological capability. The essence of these models lies in their ability to generate text sequences that are coherent continuations of a given input; this allows them to perform a broad spectrum of tasks, enhancing their application scope to writing code, translating languages, and even playing chess. The models, now scaled to an unprecedented one trillion parameters (GPT-4), exhibit unmatched performance and versatility.

The Problem with Current Benchmarks

In this evolving narrative, there is a constant need to gauge the performance of these models with comprehensive and accurate benchmarks. Most benchmarks developed for LLMs do indeed accomplish their purpose, but their “lifespan” is typically short as models outperform the benchmark quickly, and its span of tasks is usually constrained to a few categories.

Furthermore, these benchmarks are ill-suited to identify new capabilities that language models may develop with increased scale or to characterize the breadth of current capabilities.

What is BigBench?

To address these limitations, more than 400 researchers across 132 research institutions came together to develop a behemoth of a benchmark, fittingly named “Beyond the Imitation Game,” or BIG-Bench. The benchmark contains over 204 language-related tasks, from chess-based prompts to emoji-guessing tasks.

BIG-bench aims to go beyond the imitation game by extracting more information about model behavior and determining whether a model is distinguishable from a human. For current LLMs, the tasks within BigBench are extremely difficult, and LLMs are far from mastering all the tasks. Thus, the benchmark can be comprehensive and, hopefully, long-lasting, offering continuous and standardized comparisons for future Language Models. The researchers' goal is clear: they are not trying to create a benchmark catering to a specific aspect of LLMs; they are trying to develop “the benchmark” to standardize the performance measurement of all subsequent LLMs.

The Ultimate Goal

The aim of the benchmark is not simply to provide a numerical measurement of how well the model performs in a particular type of task; instead, its goals are more ambitious and meaningful: to predict the future capabilities of LLMs.

Specifically, the authors state that in their study, they are especially interested in the relationship between scale and model performance as they hope to anticipate “the future capabilities of language models.”

A Detailed Framework: BIG-bench API:

At the heart of BIG-bench is its API, a high-level representation of language models that defines interactions between benchmark tasks and models. It offers support for JSON and programmatic tasks, with the majority being JSON tasks. These are integral to facilitating easy few-shot evaluations.

Performance is gauged using unique metrics, with contributors providing a comprehensive spectrum of evaluations and new tasks. Though primarily designed for evaluating language models, the API lays the foundation for future developments, including multi-modal capabilities.

There is no universal evaluation metric; each task will define its own metric and specify a high and low score for it. However, it is still possible to obtain a numerical number across the same category of tasks, as most of these metrics are similar. For example, all multiple-choice tasks can be measured by accuracy; thus, the performance of tasks under this category can be averaged to a single number.

BIG-bench Lite: The Compact Solution

Addressing the computational constraints of BIG-bench, BIG-bench Lite (BBL) emerges as a lightweight alternative. It is a meticulously selected subset of tasks, emphasizing diversity and specificity. BBL explores various cognitive capabilities and knowledge areas, providing a glimpse into how different models and human raters perform.

The task variety within BBL is notable, with tasks like auto debugging and logical deduction reflecting the eclectic mix of cognitive capabilities it can measure. The authors meticulously chose tasks that highlight most of the current capabilities of LLMs. Average human raters were not able to solve any of the tasks to perfection, while even the best human raters could only score a “100%” in 12 out of the 24 tasks, showing the difficulty of these tasks.

Image Source: BIG-Bench Paper (Srivastava et al.)

Image Source: BIG-Bench Paper (Srivastava et al.)

Evaluation Results

The authors evaluated BIG-bench on three major LLMs: the GPT series, BIG-G, and PaLM, the latter two from Google. The performance baseline was established by human evaluators, who could utilize virtually any resource at hand to solve these tasks.

Due to the sheer breadth of BigBench's tasks, many require domain knowledge to complete the task adequately. Therefore, human raters' scores were split into two separate measurements, one for the average score across all raters while the other records the best score achieved across all human raters.

To further demonstrate the difficulty of BigBench, the average score for the best human raters across all tasks is barely 80/100 (each individual metric is normalized to have a range [0, 100]). As for the typical person, the performance is around 45. And for the LLMs? The best of them all, GPT-based models, can barely score 15!

Image Source: BIG-Bench Paper (Srivastava et al.)

Image Source: BIG-Bench Paper (Srivastava et al.)

Additionally, it is clear that increased model parameters typically lead to better performance, anticipating that in the near future, improvements could be observed simply by expanding the sheer size of these models.

The models are critically evaluated for their calibration, ensuring they do not exhibit overconfidence in incorrect answers. Many models, such as GPT-3, are found to be poorly calibrated, indicating a significant area for improvement. However, a silver lining exists as it’s observed that calibration improves consistently as the models are scaled up.

BigBench also contains tasks measuring the degree of social bias present in the model. Specifically, they define social bias as “Given a fixed context involving people, with potential completions, does a model show a systematic preference for members of one category over another or for associating particular attributes with particular categories?”

In particular, they concluded three major findings:

Bias often increases as the scale of model increases in broader, more ambiguous context
Bias often decrease as the scale of model increases in narrower, less ambiguous context
Bias can often be mitigated through careful prompting

Most of the data/tasks are considered to be in a “broader, more ambiguous context”. For example, the task may compare the probabilities of producing certain “template” sentences, such as “The {woman, man} was a {good, bad} doctor”. It somewhat makes common sense that larger models tend to be more biased as training data increases and the ability for models to retain information increases. It is more likely to capture the broader bias reflected by data from the society, which is inherently biased.

Image Source: BIG-Bench Paper (Srivastava et al.)

Image Source: BIG-Bench Paper (Srivastava et al.)

Following, when asked specific questions, such as “The woman just won the Lasker Award for her outstanding work on mRNA vaccines, she is a {good, bad} doctor”, the model can either logically deduce the correct unbiased response (and a model’s logical reasoning abilities typically increases with size) or produce neutral responses in favor of human calibration.

Conclusion: A Journey Beyond Benchmarking

BIG-bench represents a paradigm shift in language model benchmarking. It gauges model performance and delves deeper into understanding model behavior and its approximation to human responses. The addition of BIG-bench Lite and a detailed API framework makes it a comprehensive tool, offering insights into various models and their capabilities.

However, the journey doesn’t end here. The evolving landscape of language models necessitates continuous exploration and understanding. BIG-bench is a pivotal step in this direction, illuminating the path for future developments and ensuring that the evolution of language models is ethical, responsible, and aligned with the collective progression of humanity.

Reflections on Progress

The development of BIG-bench echoes the need for benchmarks that can adapt to the dynamic advancements in language models. It marks the convergence of technology and cognition, opening avenues for exploring the uncharted territories of artificial intelligence.

The meticulous analysis, diverse task inclusion, and detailed evaluation paradigms of BIG-bench present a holistic picture of where we stand in the quest for creating machines that can understand and respond like humans. The road ahead is promising and fraught with possibilities, and BIG-bench serves as the compass guiding this exhilarating journey into the future of generative language models.