Why Bigger Isn’t Always Better for Language Models
Zian (Andy) Wang
In the world of artificial intelligence, size has often been a key barometer of power and capability. It's a narrative that's been widely upheld, particularly with the emergence of increasingly sophisticated AI models.
Recently, the founder of Comma.ai, George Hotz, later backed by Soumith Chintala, co-founder of PyTorch at Meta, confirmed that all together, GPT-4 contained over 1.7 trillion parameters: more than ten times the 175 billion parameters GPT-3.5 contained. Th Although GPT-4 isn't a monolithic giant with over a trillion parameters but instead a collection of smaller models, each with around 220 billion parameters, this still raises a question in the world of LLMs: does a larger size always equate to improved performance; or could we be mistaken?
The first iteration of the GPT model family, introduced by OpenAI in 2018, boasted a modest 117 million parameters. Its successor, GPT-2, pushed past the billion-parameter mark, exhibiting a more than tenfold increase from its predecessor, with 1.5 billion parameters. The progression didn't halt there. GPT-3 further upped the game with a whopping 175 billion parameters—more than 100 times that of GPT-2—and GPT-4 has continued the trend, boasting over 1.7 trillion parameters, a tenfold increase from GPT-3.
It should be clarified that OpenAI has not confirmed or denied any of the details related to GPT-4’s architecture, but given the converging opinions of many AI experts, we’ll operate on the assumption that the putatively “leaked” stats, model architecture, and training details are, at least, directionally correct. However, operating on the assumption of “generally correct-ness”, the pattern is clear: each successive generation brings slight architectural modifications and a significant amplification in parameter size, seemingly enhancing performance in each step.
However, when we delve into the realm of conversational chatbots, defining "performance" becomes a trickier task. In the context of training large language models, performance might be straightforwardly gauged by metrics measuring the accuracy of the next token prediction. But when everyday users interact with ChatGPT, performance assumes a more nuanced meaning. It hinges on the chatbot's ability to respond to user requests with a degree of precision and efficiency that renders the interaction helpful and satisfying.
From the Consumer’s Perspective
With the user count of ChatGPT falling for the first time since its release by a whopping 9.7% in June, users might be seeking alternatives that are less costly and better suited for their needs.
Sure, GPT-4 may reign supreme to other chatbots in its general abilities, but it's also costly for the average user, with the subscription summing up to $20 per month while only allowing 25 messages every 3 hours.
From the perspective of the consumer, there are many alternatives that are practically free that offer similar, if not better, performance to GPT-4. After the initial hype, GPT-4 seems more like a cool tech demo that OpenAI occasionally integrates new features onto rather than a convenient app to have in your digital tool belt.
For example, Quillbot, a website that uses GPT models to help you rephrase and rewrite sentences, is not only free but will also preserve the original structure of your writing: avoiding being flagged by various AI content detectors. Additionally, ChatPDF allows you to upload any pdfs to their website, and a Large Language Model will analyze its content and answer any questions you have about the document. There are more amazing AI tools like these that are offered at a much lower price, or even free in the case of the two mentioned above, that can accomplish tasks faster and arguably better than GPT-4.
In a world where working smarter produces far more valuable returns than working harder, the ultimate goal of LLMs is to improve the quality of life, not satisfy performance metrics.
Censorship Drives People to Seek Alternatives
Another huge reason for the decreased user numbers in GPT-4 can be attributed to OpenAI’s censorship of the chatbot. While it's unquestionable that some degree of control is necessary to prevent misuse and ensure ethical usage, it's also a double-edged sword. Censorship tends to temper the raw potential of GPT-4, curbing it from delivering responses that lie within its true capabilities.
With this in mind, consumers are often left dissatisfied as they are denied access to a breadth of information, responses, and interactions that GPT-4 can otherwise provide.
For instance, the system is programmed to avoid certain controversial topics or to provide overly cautious responses in many scenarios, leading to a perceived reduction in the model's authenticity and utility. This approach can inadvertently stifle creativity, freedom of expression, and even the quality of technical outputs in some cases, as users are not presented with the full range of possibilities that the model can generate.
To avoid censorship, people seek models that can be run locally without the intervention of a third party, which solidifies the idea that bigger isn’t always better.
From a Technical Perspective
Unfortunately, a model like GPT-4 is not only closed-source but also impossible to run locally for an average user, even if its code were available, due to the sheer size of the model. Luckily, since the release of GPT-4, many alternatives have been much smaller in size and have somewhat comparable performance to the GPT family of models.
HuggingFace has a web page dedicated to ranking the performance of open-source LLMs, all of which is feasible for an average user to get running. On the smaller end of the model, there’s Stanford’s Alpaca 13B, a model small enough to comfortably run on a modern personal laptop with its performance matching that of GPT-3.5. On the larger side of things–the best of the best in open-source LLMs, Falcon 40B-instruct, only has a fraction of the parameters compared to GPT-4 but ranks first among all open-source LLMs and it has an Apache 2.0 license, meaning that it can be adapted for commercial use.
Furthermore, Microsoft research unveiled an open-source 13 billion parameter model in early June, known as "Orca." According to the paper published, this model rivaled or even outperformed GPT-4 in specific tasks, while its overall performance is on par with GPT-3.5. Intriguingly, Orca was trained to mimic and internalize the reasoning process of Large Foundational Models like GPT-4. This is particularly interesting because, just a month earlier, UC Berkeley researchers had published a paper asserting that "model imitation is a false promise," arguing that imitation extends only to style, not intelligence. The advent of Orca has effectively debunked this assertion.
The mounting evidence suggests that smaller language models offer advantages that outweigh their larger counterparts. The high cost of training and updating models, with GPT-4's training cost exceeding $100 million, seems unjustifiable considering its limitations. Many users desire to fine-tune LLMs for their specific needs, but even a model of GPT-3.5's size presents significant challenges for individual customization and fine-tuning. The operational complexities and resource demands of larger language models can overwhelm the average machine-learning enthusiast.
From a Performance Perspective
Regarding performance, a recent paper published with the title “Inverse Scaling: When Bigger Isn’t Better” showed that bigger models are not necessarily better than smaller models, even by metric-based standards.
Specifically, the paper stated that “Larger LMs are more susceptible to memorization traps–situations in which reciting memorized text causes worse task performance”. This somewhat makes sense: larger models are better at memorizing and retaining knowledge, even the incorrect ones.
Further complicating the performance landscape is the phenomenon known as "resisting correction," where larger models struggle to reproduce user inputs when those inputs end in unconventional ways, such as a sentence containing words with unconventional spelling or a slightly modified version of a famous quote.
Many more instances are presented in the paper that demonstrate the “inverse scaling” behavior of LLMs. For example, researchers found that larger language models have trouble with “redefinition,” where the users redefine a commonly used symbol or concept and tests the model’s ability to reason with the new definition.
A general conclusion can be drawn where larger language models are more likely to fall into the “wrong trap” and have difficulties being corrected into the right path. In a sense, they are less flexible and more confined to what the training data provided rather than using logical reasoning.
We can see this proven in more than one study. Microsoft's paper on Orca demonstrates that ChatGPT and Orca are better than GPT-4 at the “Web of Lies” task, which tests logical reasoning in boolean expressions and natural language.
Is the Magic of Large Language Models Fading?
This is not to say that developing a larger language model is useless in the future, rather, they may serve a different purpose. In the Microsoft paper that introduced Orca, the model was not trained from scratch; rather, it was trained with the help of GPT-4. The Orca model was trained “on the outputs generated by large foundation models (LFMs)... including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT.” This process, known as imitation learning, allows for the creation of smaller, more efficient models that retain the capabilities of their larger counterparts while even surpassing their larger counterparts in some cases.
In this context, larger models like GPT-4 can be considered valuable resources for training smaller, more efficient models. They act as the 'teachers' in the imitation learning process, providing a rich source of information from which smaller models can learn.
Moreover, larger models can serve as a benchmark for research, pushing the boundaries of what is technically possible and inspiring new methodologies and techniques. Larger models aren’t always “worse” than smaller ones they are almost always better at retaining and retrieving information.
However, as we've seen, significant trade-offs are associated with larger models. They are more expensive to train and deploy, more challenging to control and fine-tune, and can exhibit counterintuitive performance characteristics. They may also be less accessible to individual users and small businesses, who may not have the resources to utilize them effectively.
In contrast, smaller models can offer a more balanced mix of performance, cost, and usability. They can be run on personal devices, are easier to control and customize, and often achieve comparable performance to larger models, especially when trained using imitation learning techniques. They also align better with the ethos of democratizing AI, making powerful AI capabilities accessible to a broader range of users.
Although larger language models may seem comparatively more capable and, in a way, “smarter” than smaller language models, the evidence suggests that bigger isn't always better. As AI continues to evolve, we may see a shift towards increased development of smaller, more efficient models that offer a better balance of performance, cost, and usability.
In this regard, smaller language models may well represent the future of language models, offering a compelling blend of power and practicality that larger models struggle to match.