AI Breakdown or: I Read the Entire 78-page Llama-2 Paper So You Don’t Have To
Jose Nicholas Francisco
First thing’s first: We actually broke down the Llama-2 paper in the video above. In it, we turn seventy-eight pages of reading into fewer than fifteen minutes of watching. You can also check out this article that we published the day Llama-2 came out.
The link above should satisfy your appetite for under-the-hood knowledge of these shiny, new generative AI models. This article, therefore, acts as a fun supplement to the video. First, we’re going to briefly go over some key points from the Llama-2 paper for some context. Then, we’re going to share some fun facts about Llama that had to be cut from the video due to time/production constraints. Feel free to share these bite-sized fun facts at your next cocktail party or during that small window of time that you’re waiting for everyone to enter the Zoom meeting. You will be the coolest kid in the office.
Llama-2, the TL;DR
Alright, the video above goes over the architecture of Llama 2, a comparison of Llama-2 and Llama-1, and finally a comparison of Llama-2 against other non-Meta AI models. Let’s go over these subjects one-by-one.
Llama-2 isn’t a single model, but rather a collection of four models. The only difference between each of these models is the number of parameters they contain. From smallest to largest, the Llama-2 models contain 7B, 13B, 34B, and 70B parameters. But otherwise, everything else about them—from their activation function to their normalization method—is identical.
And speaking of their pretraining settings, just know the following four facts:
Llama-2, much like other AI models, is built on a classic Transformer Architecture.
To make the 2,000,000,000,000 tokens and internal weights easier to handle, Meta employs a technique called RMSNorm—short for Root Mean Square Normalization.
To determine whether a given neuron should be active or not, Meta decided to use the SwiGLU activation function. (This is purely a design choice. They could’ve chosen any function they wanted.)
And finally, to ensure that Llama-2 understands the fact that the positions of words in sentences is as important as what the words are, it employs a new mathematical method called RoPE, or “Rotary Positional embedding”
Honestly, each of these pretraining settings can have their own blog dedicated to them. But for the purposes of this bite-sized article, knowing what they’re called and what they do is enough. Again, feel free to share these facts at your next cocktail party.
The only other thing to note is that there is a Chatbot version of Llama-2 that you can speak with right now. It’s called “Llama-2-Chat” and it’s available right here! What makes Llama-2-Chat different from, say, ChatGPT is a new finetuning method called Ghost Attention (GAtt) that Meta invented just for their new chatbot.
To understand Ghost attention, all you need to know is one thing: A popular way humans interact with chatbots is to first say something along the lines of “You are an expert physicist” and then ask it about physics. Or maybe you’d tell the bot “You are Napoleon Bonaparte, and you will treat me as if I am one of your commanding officers.” Then you’d have a “conversation” with Napoleon.
Or even in the case of coding, you could say “You are an expert Python coder” and then ask the AI to code or debug with you. We actually did this, here.
Well, given this popular pattern of speaking with chatbots, Meta decided to introduce “Ghost Attention” into its finetuning repertoire. Basically, to ensure that the Llama 2 indeed acts like Napoleon or an expert Python coder throughout the conversation, Meta synthetically concatenates the ‘act as’ instruction’ to all of the user messages of the conversation.
Then, once the model has learned that these “Act as” instructions should carry on throughout the entire conversation, the Ghost Attention method calls for dropping this concatenation later down the line of finetuning.
The result? Well, Meta has found that this new Ghost attention technique helps control dialogue flow over multiple turns.
Now, given these new techniques and fancy design choices, let’s see how Llama-2 compares with other AI models.
How Llama-2 Compares
There are three major competitors to compare Llama-2 against: Llama-1, open-source models, and closed-source models. It’s worth noting that Llama-2 is open source itself. So there’s an argument to be made that Llama-2 is itself a representative of open-source efforts in the generative AI space. But let’s not get too ahead of ourselves. First, let’s compare…
Llama-2 vs. Llama 1
As expected, Llama-2 is bigger and better than its predecessor. Not only in terms of architecture, but also with respect to benchmark performance.
Llama 2 comes equipped with more parameters, a longer context length, a larger training set, and Grouped Query Attention (GQA) for improved inference scalability.
And the result of being bulkier than its predecessor? Llama-2 crushes Llama-1 in various benchmarks:
From common sense reasoning to the MMLU (read: the SAT of AI training), Llama-2 takes the cake and eats it too. The 7B parameter Llama-2 defeats the 7B parameter Llama-1 across all benchmarks. And the same goes for the 13B parameter versions of each.
And, of course, the 70B parameter Llama-2 earns the top score in every benchmark overall, defeating every version of Llama-1 and all the (smaller) versions of Llama-2.
But how does Llama-2 compare to other non-Meta open source models?
Llama-2 vs. Open Source Models
Again, as indicated by the boldface numbers at the very bottom row of the table, Llama-2-70B defeats all other models across all of these benchmarks. In fact, even Llama-2-7B defeats all other 7B open-source models (with one exception… can you find it? 👀)
Finally, let’s compare Llama-2 against closed source models, including the beloved GPT-4.
Llama-2 vs. Closed Source Models
Well… it looks like we’ve got some competition here.
Credit to Meta for openly admitting that Llama-2 is not the best model when compared to the closed-source world. In fact, there seem to be two clearly reigning champions in this table: GPT-4 and PaLM-2-L. Each of these models wins 1st place in exactly half of the benchmarks given.
Meanwhile, Llama-2 wins none of them.
But hey, that’s how the cookie crumbles. Maybe Llama-3 will give these other models a run for their money. Nevertheless, the fact that Meta openly showcases that their model is out-benchmarked really speaks to the credibility of their research team. We even Tweeted about it here.
And now that we’re all on the same page, here’s the moment you’ve all been waiting for:
Quick, Bite-sized Facts About Llama-2
Here are the dinner-party facts that we cut out from the video. Feel free to memorize them and dish out these soundbytes (I'm so sorry) whenever the opportunity arises.
Meta did *not* use any Meta user data (read: your Facebook/Instagram/Threads profiles) when training Llama-2. They mention this four times in the paper (Sections 2.1, 3.1, 4.1, and the Appendix).
Meta used another AI called HateBERT to measure how “toxic” each of Llama-2’s training documents were. About 99.8% of the documents scored 5/10 or less on the toxicity scale—with 10 being the most toxic. (See Figure 13 in the paper.)
The most common language to appear in Llama-2’s training set is English. The second-most common language? Code. (See Table 10 in the paper.)
During training, Meta calculated that 539 tons of CO2 equivalent were emitted. However, they also state that “100% of these emissions are directly offset by Meta’s sustainability program.”
When humans compared Llama-2-Chat to ChatGPT on helpfulness, Llama-2 won around ⅓ of the time, lost around ⅓ of the time, and tied around ⅓ of the time.
Llama stands for “Large Language Model Meta AI.” I’m still not sure what the M stands for.
To test the Ghost Attention method, Meta’s researchers made Llama-2-Chat impersonate Oscar Wilde (See Figure 10 in the Paper.)