To say the least, artificial intelligence is one of the most compelling sectors of the tech economy today. Generative AI, in particular, has exploded in popularity, starting first with more mainstream adoption of text-to-image models like Stable Diffusion, Midjourney, and DALL-E. But it really does feel like another area of interest—generative text-to-text models such as OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Bard, and others—has taken the world by storm.

Although Generative AI for text has been “a thing” for a while—with academic roots extending all the way back to rule-based sentence-generation systems engineered in the 1950s, and pioneering conversational interfaces like 60’s-vintage ELIZA, all affected by AI’s boom-and-bust cycle along the way—deep learning-powered language models (both large and domain-specific) are now de rigueur. (For the non-Francophiles among us, please consult the following screenshot of Llama2.ai, a front-end playground for testing Llama 2, developed in part  by venture capital firm Andreessen Horowitz.)

Screenshot of Llama2.ai on July 18, 2023.

Screenshot of Llama2.ai on July 18, 2023.

Meta (née Facebook) just unveiled the latest version of its open source large language model family, Llama 2. (See: the Announcement page, the Technical Overview page, the Research Paper, and accompanying Model Card on Meta.com and Github.)

As with the release of Llama 1, pre-trained versions of Llama 2 come in a variety of sizes: 7B, 13B, and 70B parameters. (Meta also trained a 34B parameter Llama 2 model, but are not releasing it.) Unlike Llama 1, which was just the general-purpose LLM,  Llama 2 also comes in a chat-tuned variant, appropriately named Llama 2-chat, which is available in sizes of 7B, 13B, 34B, and 70B parameters. Based on the pre-trained base models mentioned above, Llama 2-chat is fine-tuned for chat-style interactions through supervised fine-tuning and reinforcement learning with human feedback (RLHF), but more on that in a bit.

The Nuts and Bolts of Llama 2

Meta states that Llama 2 was trained on 2 trillion tokens of data from publicly-available sources—40 percent more than its first iteration—and has a context length of 4096 tokens, twice the context length of Llama 1. If you think of context length (also known as a context window) as roughly analogous to human working memory, a bigger context length allows for more in-depth comprehension and completion of tasks and relevant information passed in the user prompt. For example, in a summarization task, a user could now prompt Llama 2 to analyze and distill a block of text that’s roughly twice as big as what Llama 1 could satisfactorily summarize.

Additionally, following the principles of AI scaling, it’s understood that models trained on more data are usually more capable than models trained on comparatively less data; this is reflected in Llama 2’s increased performance on common LLM benchmarks, as can be seen in the table below. 

Llama 2 model benchmarks, via Meta.

Llama 2 model benchmarks, via Meta.

Llama 2-70B (the largest pre-trained Llama 2 model available) roughly matches or exceeds performance of the largest Llama 1 model, which weighed in at around 65 billion parameters. More impressive, however, is the the littlest llama; Llama 2-7B significantly outperforms other comparably-sized open source models—like MPT 7B and Falcon 7B—on all reported benchmarks except OpenAI’s HumanEval, in which MPT 7B eked out a notable win over Llama 2-7B. Midsize model Llama 2-13B also punches well above its weight, matching or outperforming open source models nearly 2 or 3x its parameter count in roughly half of the benchmarks reported above.

Llama 2-Chat

Chat is the predominant interface for large language models, at least for the general public. Software developers and prompt engineers may feel comfortable with a more close-to-the-metal user experience—calling an API from the command line, or adjusting model hyperparameters in a playground/sandbox environment, for example—but anyone who has ever sent a text message or used an online chat service understands the principles of interacting with another entity (human or otherwise) through a chat box. 

Part of what makes proprietary AI chat services like ChatGPT, Bard, Claude, and others compelling to the everyday user is that the language models backing them have been fine-tuned on the cadence and slang of conversation. Meta captured a bit of that interactional lightning in a bottle with Llama 2-Chat. 

Meta fine-tuned Llama 2-Chat with methods similar to other chat-tuned language models: a combination of reinforcement learning with human feedback (RLHF), supervised fine-tuning (SFT), as well as initial and iterative reward modeling. Starting with outputs of the general-purpose pretrained Llama 2 model, Meta and collaborating organizations scored those outputs on broad measures including helpfulness (e.g. the extent to which Llama 2 successfully completed a given task, such as summarization) and safety (e.g. that Llama 2 wouldn’t output insensitive or hateful content, or reveal information that could cause harm), and incorporated these human inputs into the model. 

Illustration of fine-tuning methods for Llama 2, via Meta.

Illustration of fine-tuning methods for Llama 2, via Meta.

The end result of all this effort (which builds on extant and newly-commissioned human-labeled data) optimizes for helpfulness, safety, and completeness of model outputs in an open, chat-tuned model which—at least by Meta’s measure—approximately meets or outperforms competing LLMs (both open-source and closed-source) across several metrics. Below you can find a few snapshots of Llama 2-Chat’s performance excerpted from the research paper that describes the model, but know that the paper itself contains much, much more thorough discussion of the steps taken during data collection, pre-training, and fine-tuning to achieve these results. (i.e. It’s worth the read.)

Llama 2 performance, via Meta

Llama 2 performance, via Meta

Llama 2 safety benchmarking against competing LLMs, via Meta.

Llama 2 safety benchmarking against competing LLMs, via Meta.

Llama 2 helpfulness benchmarking compared against competing LLMs, via Meta.

Llama 2 helpfulness benchmarking compared against competing LLMs, via Meta.

Licensed to Innovate

Meta didn’t state an explicit strategic goal (apart from generally advancing the state of the art in certain types of foundational large language models) in its announcement or the accompanying research paper, but at least as of right now, Llama 2 and Llama 2-Chat (in almost all their associated weights and measures) set a new standard for open source large language models. And Llama 2 is licensed for innovation.

To be clear, Llama 2 and its variants are licensed for research and commercial use. (With one asterisk: Section 2 of the license stipulates that if a company has more than 700 million monthly active users across its products or services at the time of Llama 2’s release, said companies would need to explicitly request a license to use Llama 2 from Meta. In other words, the eye of Sauron is staring directly at Snapchat, Telegram, Tiktok, et al. and, to severely mix our LOTR metaphors here, effectively says to those players, “you shall not pass.”)

Such an explicitly permissive license kind of changes the LLM game, to the point where business analyst Ben Thompson wrote in a Stratechery daily update that:

There is a very good case that we look back on this announcement as being on par with the launch of ChatGPT in terms of impact, and arguably a more important one when it comes to the actual utility of AI. Now a near state-of-the-art LLM can be built into anything and run anywhere, not simply constrained to one specific interface governed by one specific company.

Ben Thompson

Stratechery Daily Update (July 19

To understand why Thompson and others have made this claim, it’s important to remember that v1.0 of Llama (then capitalized as LLaMa to punch up the pun) became more-or-less open source by accident. Which is to say: Someone leaked LLaMA 1’s model weights and, suddenly, any enterprising developer with sufficient appetite for the risks associated with possible accusations of terms-of-service (TOS) violations was able to build off a language model originally intended only for noncommercial research use cases. In other words, because LLaMa 1 wasn’t licensed for commercial use, any entrepreneur using LLaMa 1 or its derivative models (named after other ungulates like Alpaca and Vicuna, among others) in commercial applications is, technically, not permitted by LLaMa 1’s license, and, from an IP compliance standpoint, operated on somewhat shaky ground. Llama 2's license changes that, enabling basically any other startup, developer, or enthusiast to augment, extend, distill, fine-tune, or otherwise use Llama 2.

Llama 2’s license, again, not only permits commercial use, the model and its weights are available to virtually anyone who agrees to the license and commits to using Llama 2 and its variants responsibly. Startups, researchers, hobbyists, and LLM enthusiasts are able to build Llama 2 into commercial applications without much concern for licensing ‘gotchas.’ Put differently, Meta’s newest open source large language model is a straight shot across the bow of closed-source LLM service providers like OpenAI, Anthropic, and Alphabet, among others. 

With the schematics floating around freely, don’t be surprised if a fleet of Llama 2-driven projects and derivative models set sail soon.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo