Code Llama: Meta’s Answer to Generative AI Code-Writing
Jason D. Rowley
Meta, the company formerly known as Facebook, revently announced their latest LLM project, Code Llama, which, unsurprisingly, is a large language model purpose-built to write code based on natural language prompts. It’s like Llama 2’s cousin, but with a computer science degree.
It’s the latest entrant to an increasingly crowded playing field of code-writing generative AI models that’s currently led by the likes of OpenAI’s GPT-4 and Code Interpreter, Github Copilot (developed in collaboration with OpenAI), Amazon CodeWhisperer, and Google DeepMind’s AlphaCode model. What sets Code Llama apart from the herd is its relatively permissive community license—the same license under which Meta released Llama 2 to the world—which positions Code Llama to be the foundation model that undergirds a new wave of developer tools and coding assistants. This is not to discount the other technical advancements Code Llama puts forth, but it’s a major differentiating factor between open models like Code Llama and closed models like GPT-4 et al..
There’s a lot to cover, so let’s get to it. And in case you want to follow along at home, we’re basing most of this analysis on Meta’s general-audience PR announcement, the post on its AI blog, and select highlights from the research paper discussing Code Llama in depth.
Code Llama’s Capabilities
Code Llama builds upon Meta’s Llama 2 model, which was publicly released roughly one month earlier. Llama 2 was pre-trained on a mix of internet text, books, code, and other data, but it should be noted that in Llama 2’s 2 trillion-token training corpus, only 80 billion of those tokens (or about 4% of the total) were code. Demonstrating state-of-the-art performance on many natural language tasks, Llama 2 is arguably the most capable open-weight LLM available today. Since Code Llama builds upon Llama 2 with extensive training on code, it inherits much of the same language understanding capabilities as its more general-purpose progenitor.
Before continuing, it’s important to note that it’s difficult to disentangle Code Llama’s capabilities from the model variants that Meta released. This will become evident in a later section where we discuss the model variants on offer. For the purposes of this section, however, we’ll keep the discussion fairly high-level and leave the nitty-gritty details for later.
As described in the paper, Code Llama exhibits the following capabilities: code generation, code discussion, code completion and debugging, and support for multiple programming languages. The extent to which these capabilities manifest themselves is a function of Code Llama’s additional code-focused pretraining and fine-tuning. Meta’s coding model is trained on 500 billion tokens of code-heavy data, and its variants are trained on yet more additional data specialized for their intended use. Whereas Llama 2 is pretty good at generating natural language outputs but is a mediocre coder at best, think of Code Llama as its inverse: It can write Python pretty well, but a talented prose writer it is not.
Code Llama is adept at generating code based on textual prompts provided by the user. For instance, when given a prompt like "Write me a function that outputs the fibonacci sequence," Code Llama can produce the corresponding code. As for how Code Llama “knows” how to code, remember that the model has been trained on a vast dataset of publicly available code, which includes discussions about code and code snippets.
This extensive training allows it to understand and generate code based on user prompts. For example, when prompted to define a Fibonacci function, Code Llama might produce a function that calculates the Fibonacci sequence for a given number 'n'.
Beyond just generating code, Code Llama can engage in discussions about code. This means it can offer insights, explanations, and even discuss the logic behind certain code snippets.
This capability can be particularly useful for learners and developers seeking to understand the intricacies of a code segment or looking for best practices in coding.
Code Completion and Debugging
Code Llama is equipped to assist developers in completing their code. If a developer is stuck or unsure about the next line of code, Code Llama can suggest completions. Additionally, it can help in debugging by identifying potential issues in the code and suggesting fixes.
Code Llama's base models support code infilling based on the surrounding content. This makes them ideal as code assistants that make for more engaging collaborative partners than the rubber duck a programmer may have on her desk. For instance, if a developer starts writing a function but leaves it incomplete, Code Llama can suggest the missing parts.
Another standout capability of Code Llama is its support for a wide array of programming languages, including Python, C++, Java, PHP, C#, TypeScript, and Bash. This broad support ensures that developers across different domains can benefit from its capabilities.
The Various Flavors of Code Llama
Much like how Meta released a few different variants of Llama 2 (namely, the base foundation model and another version that’s fine-tuned for chat), Code Llama comes in a few shapes and sizes:
Code Llama — Code Llama is Meta’s foundation model for code generation, and comes in three model sizes: 7B, 13B, and 34B parameters. Meta notes that the 7B and 13B variants are trained to accomplish a code-infilling objective, and that these model sizes are “appropriate to be used in an IDE to complete code in the middle of a file.”
Code Llama - Python — Also available in 7B, 13B, and 34B parameter sizes, Code Llama - Python is what it says on the can: a finetuned version of the base Code Llama model specialized for generating and discussing code written in the Python programming language. Alongside the 500B tokens of code-heavy data used to train the base Code Llama model, Meta finetuned Code Llama - Python on an additional 100B tokens from a Python-heavy dataset.
Code Llama - Instruct — Like other variants, Code Llama - Instruct is available in 7B, 13B, and 34B parameter sizes. Like with the Python-programming version, Code Llama - Instruct is differentiated from the base model by its finetuning data, consisting of approximately 5B tokens of instruction-following data. This is drawn from a couple of sources: the existing human-written instruction dataset used by Llama 2, and a self-instruction dataset, the creation of which is discussed in the paper. Basically, Meta used Llama 2 70B to generate interview-style coding questions, and for each question they generated unit tests by prompting Code Llama 7B, generate ten Python solutions by prompting Code Llama 7B, and finally running unit tests on the ten solutions. Those tests which passed were added to the self-instruct dataset, consisting of around 14,000 question-test-solution triplets, according to the research paper. Overall, pretty clever.
Notably, these models offer considerably larger context windows—up to 100,000 tokens—than many proprietary and open coding models. Think of an AI coding model’s context window as roughly akin to human working memory. The bigger the context window, the more stuff it can “remember” and make use of as it generates more code. This makes Code Llama’s context window roughly 25 times larger than base Llama 2’s context window, and roughly 3 times larger than GPT-4’s context window.
Below, you can see a diagram of how Meta trained each variant.
Raising the Bar for Open Coding Models
Leading the herd of open language models built to generate code, Code Llama exhibits state of the art (SOTA) capabilities across a number of benchmark tests. The table below, which was created and first published by Meta on the company’s AI blog, shows how Code Llama and its variants stack up against competing open models like StarCoder and Palm-Coder as well as closed proprietary models like Codex and GPT-4.
A few patterns emerge. First, it will come as no surprise that Code Llama’s larger models (as measured by parameter count) perform comparatively better than littler Code Llama models. The increased number of parameters allows the model to learn a wider range of linguistic nuances, relationships, and contextual information from vast amounts of training data. As a result, these models can generalize better to unseen data, providing more accurate and coherent responses.
OpenAI’s proprietary GPT-4 model is still the reigning coding champion when it comes to the HumanEval benchmark, but Code Llama—specifically its Python-focused variant—is not too far behind. Considering the fact that even Code Llama - Python 34B is orders of magnitude smaller than the estimated parameter count of GPT-4, that isn’t too shabby. Code Llama - Python 34B handily outperforms both prompted StarCoder and StarCoder Python models, with Code Llama - Python’s smaller parameter-count models performing roughly on par with the two StarCoder variants and the Palm-Coder model tested here.
Permissively Licensed for Future Developments
Code Llama is distributed under the same license as Llama 2, which is fairly permissive as model licenses go. Code Llama can be freely used for research and commercial purposes, with one notable exception: companies whose products have 700 million or more monthly active users are required to contact Meta to work out a licensing agreement.
With Code Llama out in the wild and available for most developers to use and build upon, don’t be surprised if we see cousins of Code Llama spring up in the future.