The Underdog Revolution: How Smaller Language Models Can Outperform LLMs
Zian (Andy) Wang
In the world of artificial intelligence (AI), bigger has often been seen as better. The advent of large language models (LLMs) like GPT-4 has sparked both awe and concern, as these giant AI models have demonstrated remarkable natural language understanding and generation capabilities.
But in the shadow of these behemoths, a quiet revolution is taking place. Recent research suggests that smaller language models, once thought to be mere stepping stones to their larger counterparts, are starting to outperform—or at least match—the performance of LLMs in various applications. In this comprehensive article, we'll explore this intriguing development, discuss its relevance, and examine its potential implications across industries. We'll delve into the world of smaller language models, compare them with related technologies and concepts, and provide real-world examples of their impact. Finally, we'll conclude with a thought-provoking outlook on the future of smaller language models and their potential impact on society.
The pursuit of AI systems that can comprehend and produce human-like language has fueled the development of LLMs. LLMs have been shown to perform impressively in tasks such as translation, summarization, and question-answering, often surpassing the abilities of earlier, smaller models. However, these accomplishments come with significant drawbacks, including high energy consumption, large memory requirements, steep computational costs, and large water footprints. Another major concern is that the rate of GPU innovation lags behind the growth of model size, possibly leading to a point where scaling is no longer feasible. These factors have led researchers to explore the potential of smaller language models, which may be more efficient and versatile in certain applications.
Emerging Techniques & Research
Recent studies have demonstrated that smaller language models can be fine-tuned to achieve competitive or even superior performance compared to their larger counterparts in specific tasks. For example, research by Turc et al. (2019) found that distilling knowledge from LLMs into smaller models resulted in models that performed similarly but with a fraction of the computational resources required.
Additionally, the rise of techniques like transfer learning has enabled smaller models to leverage pre-existing knowledge and adapt more effectively to specific tasks (Source). This has led to breakthroughs in applications like sentiment analysis, translation, and summarization, where smaller models have shown comparable or superior performance to LLMs.
In recent developments, smaller language models have been demonstrating remarkable performance, rivaling their larger counterparts. DeepMind's Chinchilla model, for example, outshines GPT-3 by training on a larger dataset with fewer parameters. Likewise, Meta's LLaMa models have achieved impressive results, while Stanford researchers developed the Alpaca model, which, when fine-tuned on GPT-3.5 query responses, matches its performance. Additionally, Stability AI's StableLM series features the smallest model with only 3 billion parameters. Numerous other GPT/LLaMa-based models exist, boasting significantly fewer parameters and results comparable to GPT-3.5. A comprehensive list can be found in this Medium article.
Stanford's research paper on Alpaca 7B explains how, at a substantially reduced cost, Alpaca's performance parallels that of ChatGPT. Two crucial factors identified are a robust pre-trained language model and high-quality instruction-following data. To acquire the latter, Stanford applied recommendations from a paper introducing training procedures for instruction-following models, using large language models (LLMs) to generate data automatically. This suggests that the instruction-following fine-tuning process may play a critical role in training smaller conversational LLMs, potentially even more important than the model architecture itself.
While these advancements make LLMs more accessible, the relationship between size and performance remains unclear. OpenAI's proposed LLM scaling law demonstrates a performance increase with model size, yet we are witnessing counterexamples achieved through improved training techniques or alternative architectures. Promising strategies, such as the "mixture of experts" approach and exploiting sparsity in LLMs, could enhance efficiency. However, current hardware limitations and challenges in applying techniques like quantization and knowledge distillation to larger models pose obstacles. Despite these hurdles, the pursuit of smaller, more efficient models is essential, as both scaling and enhancing efficiency at smaller sizes could play vital roles in the future of AI development.
Two recent techniques proposed by Google, UL2R, and Flan, have shown significant potential for improving smaller language models' performance without the need for massive computational resources. UL2R, or "Ultra Lightweight 2 Repair," is an additional stage of continued pre-training that enhances performance across a range of tasks by introducing a mixture-of-denoisers objective. This has resulted in models that show emergent abilities in tasks like the Navigate and Snarks tasks from BIG-Bench without increasing the model's scale (Source).
Flan, on the other hand, is an approach that involves fine-tuning language models on over 1.8K tasks phrased as instructions. This method not only enhances performance but also improves the model's usability in response to user inputs without the need for prompt engineering. When combined, UL2R and Flan can result in a model like Flan-U-PaLM 540B, which significantly outperforms unadapted PaLM 540B models (Source). These techniques showcase the potential for smaller models to achieve impressive performance gains without the large-scale investment typically associated with LLMs.
Smaller language models may not generalize as well as the larger ones since they simply do not have enough space to “store” the information. However, a paper from Yao Fu et al. shows that smaller language models can demonstrate superior reasoning abilities when trained and fine tuned to a specific task. The authors trained the FlanT5 series of models on difficult mathematical word problems. The latter models, it’s worth noting, could be considered tiny in modern times with its largest variation only going up to 11B parameters and its smallest having as few as 250M parameters. During training, authors “distilled” knowledge from the larger code-davinci-002 models by allowing it to generate correct “chain of thought” solutions to the problems in the dataset.
Efficient data utilization is a recurring theme in the realm of small language models. In the paper "It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners" by Timo Schick et al., the authors propose the use of specialized masking techniques in conjunction with imbalanced datasets to enhance the performance of smaller models. It appears that by focusing on these innovative strategies, researchers can further unlock the potential of small language models in various applications.
Why Does it Matter?
Smaller language models have a number of advantages, starting with the speed of training and inference. These efficiency advantages extend to other, secondary aspects of their use, such as these models’ comparatively smaller carbon and water footprints.. In today’s AI industry, the focus is shifting towards making AI more accessible and ensuring that it performs well on a variety of devices, including smaller, resource-constrained hardware such as cellphones. This trend highlights the growing importance of exploring the potential of smaller language models to strike a balance between performance and resource consumption.
Smaller models cater to the industry's demand for AI solutions that are efficient, versatile, and compatible with a broad range of devices. By achieving high performance without relying on resource-heavy infrastructure, smaller models pave the way for AI to become more deeply integrated into everyday technology, expanding its reach and impact.
A key development in this area is federated learning, a decentralized approach to AI model training that prioritizes data privacy and security. By allowing data to remain on local devices while learning from it, federated learning reduces the need for large-scale data centralization and enables AI applications to be more responsive, adaptable, and accessible across various industries. The compatibility of smaller language models with federated learning environments further emphasizes their importance in shaping the future of AI.
But who knows? Maybe more efficient hardware devices will be developed before smaller language models can surpass the performance and outweigh the benefits of LLMs.