Since the debut of the Transformer architecture in the groundbreaking paper “Attention is All You Need” 7 years ago, the landscape of machine learning has been fundamentally transformed. This architecture, along with its self-attention mechanisms, has creeped into every corner of machine learning, from computer vision to reinforcement learning. Predominantly, modern Large Language Models are built upon the foundations of the Transformer architecture and its core principles. However, as the development of LLMs flood the nuance of each model, the evolution and divergence of these models’ architecture from the original Transformer design is often less documented. It typically requires digging through papers to trace the origins and reasons for each modification.

This article aims to shed light on some of the most pivotal and influential LLM architectures, delving into the rationale behind their specific design choices. A clear understanding of the original Transformer architecture is assumed. To learn more, checkout this article.


Undoubtedly, the revolution of LLMs all began with the release of ChatGPT on November 30th, 2022. ChatGPT was based on the GPT-3 architecture and fine-tuned for conversation using Reinforcement Learning with Human Feedback (RLHF).

The GPT line of models is more than game-changing, from GPT to GPT-4, at every iteration, the GPT series almost always reigning on top of all other NLP models at the time.

In the original GPT paper, the model sets itself apart from all others at the time by employing a decoder-only architecture. Some of the best performing models at its time, such as the Bidirectional Transformer (BERT), utilizes both the encoder and the decoder module as outlined in the original “Attention is All You Need” paper. The decoder-only architecture improved the efficiency in computation as well as decreased the complexity in models. Almost every LLM after the success of GPT-3 adopted the decoder-only architecture.

Inputs and Outputs

Outputs from the GPT-3 model are generated autoregressively, similar to the original Transformer implementation. However, there is one minor difference. The input length of GPT-3 is fixed to 2048 tokens, and any input that’s shorter is padded with empty tokens until it reaches 2048 tokens.

At each prediction step, the model produces the token most likely to follow the end of the input sequence. This output token is then appended to the input sequence and re-entered into the model for the next token prediction. This process repeats until the desired length of the output sequence is reached or the model determines it has reached the natural end of the response.

The above schema applies for all modern Large Language Models.

Byte Pair Encoding Tokenization

Notice how GPT-3 deals with tokens, not words or characters. Language models prior to the GPT family had a wide variety of tokenization methods, one of them, space and punctuation tokenization, was widely used. It can loosely be explained as splitting input sequences into individual words and punctuations.

Tokenization techniques like space and punctuation will typically generate an enormous vocabulary, encompassing as many unique words and symbols seen in the model’s training data. Utilizing such tokenization methods will not only lead to increase in computational complexity but also create the problem of dealing with words outside of the model’s vocabulary during inference.

GPT-3’s adoption of Byte Pair Encoding (BPE) addresses these challenges by employing a more efficient and adaptable tokenization strategy. BPE strikes a balance between the extremes of character-level granularity and word-level generalization. Most modern LLMs rely on their own variation of BPE.

Here’s a simplified process of how BPE works:

  1. Initialization of Vocabulary: BPE starts with a vocabulary of individual characters. This vocabulary includes all characters appearing in the training corpus. Each unique character is treated as an initial token.

  2. Building the Vocabulary: The algorithm counts the frequency of each pair of adjacent tokens (initially characters) in the training data.

  3. Merging: It identifies the most frequent pair of consecutive tokens and merges them into a single new token. For example, if ‘h’ and ‘e’ are the most frequent pair, they are merged to form the token ‘he’.

Determining the Number of Merge Operations: The number of merge operations is a hyperparameter set based on the desired vocabulary size. In GPT-3, this meant a large number of merges, leading to a vocabulary that efficiently encodes common sequences in the training data.

Image by author

Image by author

To tokenize a new piece of text during inference, the model generally takes the following steps:

  1. When a new text is processed, it is first broken down into the base tokens (characters).

  2. The BPE algorithm then applies the learned merges, starting from the most frequent and moving down the list.

  3. The text is segmented into the largest possible tokens found in the vocabulary. This means that frequent words or subwords are often encoded as single tokens, while less common sequences might be broken down into smaller parts.

Note that in practice, instead of using all possible unicodes or characters as the base vocabulary, which results in over 130,000 characters, GPT-3 uses a byte-level version, containing only 256 unique characters. GPT-3 has a total vocabulary size of 50257.

For each token in the input sequence, it is one-hot encoded into a vector of size 50257, resulting in a giant, sparse 2048 by 50257 matrix.

Embedding and Positional Encoding

GPT-3 adopts a similar embedding and positional encoding technique compared to the original Transformer. Each one-hot encoded token is multiplied by a learned embedding matrix, transforming it into a dense matrix of size 2048 by 12288.

Then, based on the relative position of each token in the input sequence, that position index is fed into 12288 sinusoidal functions each with a different frequency, resulting in a matrix the same size as the embedded sequence. Finally, the positional encoding matrix is added onto the embedded matrix, where it’s ready to be processed by the attention blocks.

Self-Attention Blocks

In the original Transformer paper, each self-attention block consists of 4 components as illustrated in the below diagram:

Image from “Attention is All You Need”

Image from “Attention is All You Need”

The normalization referred to in the illustration is a layer normalization, not a batch normalization layer.

The GPT-3 model makes several modifications to the original transformer architecture.

  1. The layer normalization was moved to the input of each sub block, instead of being in after the feed forward MLP.

  2. An additional layer normalization was added at the end of the last self-attention block.

  3. GPT-3 employs a combination of dense and locally banded sparse attention patterns in its model, diverging from the approach of stacking identical self-attention blocks multiple times. The dense attention pattern adheres to the multi-head self-attention mechanism originally introduced in the “Attention is All You Need” paper. On the other hand, the “locally banded sparse attention patterns” are not explicitly detailed in the paper. However, it was said to be similar to the modified attention layers in the Sparse Transformer.

A Quick Note on Sparse Attention

Sparse attention in models like Sparse Transformers is a key innovation that addresses the limitations of standard full attention mechanisms, especially for very long sequences. This approach involves factorizing the attention matrix and strategically choosing subsets of positions to attend to, rather than the entire sequence. The subset selection is crucial for reducing computational complexity and memory usage, with the added advantage of being able to construct much deeper models.


The GPT-3 model cleverly reuses the embedding matrix learned in the input to decode the processed information. This involves multiplying the output from the self-attention blocks by the embedding matrix’s inverse. This transformation returns the matrix to its original dimensions of 2048 by 50257.

Subsequently, this output is passed through a softmax function, applied along the row dimension. Each resulting value represents the probability of its corresponding token being the next most likely in the sequence. Generally, the process focuses on just the last token, which is then selected, outputted, and reattached to the input for another pass-through.

Putting it All Together

For conciseness and consistency, we use the notations employed by most papers when describing the architecture of Large Language Models.

  • vocab_size: 50257

    • The input sequence is first tokenized by BPE and one-hot encoded into a sparse vector of length vocab_size. Every LLM’s vocab_size is the number of unique tokens understood by that model.

  • context_len: 2048

    • Every input to the GPT-3 model contains exactly 2048 tokens. Shorter sequences are padded with empty tokens to match the length. Each token is represented by the one-hot encoded vector with the entire sequence producing a 2048 by 50257 matrix.

  • d_model: 12288

    • The input matrix is multiplied by a learned 50257 x 12288 embedding matrix. The resulting matrix has dimensions 20489 by 12288.

  • d_head: 128

    • This is the dimension of the query, key, and value matrices, each with shape of (context_len, d_head). Three separate linear projections are learned to transform the embedded matrix into their respective query, key and value representations.

  • n_layers: 96, n_heads: 96

    • Each multi-head self-attention layer contains the following components in the following order:

      • Layer normalization.

      • Masked multi-head self-attention with n_heads heads.

      • The input to the attention layer is added to the output from 2.

      • A feed-forward MLP with a hidden layer of d_model * 4 neurons while retaining the input and output shape as (context_len, d_model)

      • The input to the feed-forward MLP is added to the output of 4.

      • Layer normalization.

    • The entire multi-head self-attention layer is stacked n_layers times with a final layer normalization placed at the end.

  • Decoding

    • The output produced by the multi-head self-attention layers undergoes multiplication with the inverse of the embedding matrix. It is then processed through a softmax function. Following this, the highest value(s) (interpreted as probabilities) from each row are selected, representing the most likely next token or tokens.

  • Total parameters: 175 billion.


Meta’s LLaMA emerged as the next “big thing” following ChatGPT, and LLaMA 2 was subsequently announced after GPT-4. While the standalone performance of these models may seem almost laughable compared to the standards set by GPT-4, their ease of use and open-source nature have greatly benefited many independent researchers. To this day, numerous derivatives and fine-tuned versions of LLaMA continue to lead in the small to medium LLM space.

The largest model in the LLaMA 2 family, totaling to 70 billion parameters, has the following default architecture parameters:

  • vocab_size: 32000

  • context_len: 4096

  • d_model: 8192

    • d_head is assumed to be d_model / n_heads, which in this case is 128

  • n_layers: 80

    • The hidden layer of the feed-forward MLP within the multi-head self-attention layers has a dimension of 8/3 * d_model instead of 4 * d_model

  • n_heads: 64

Compared to the GPT family of models, which served as the foundation for LLaMA, LLaMA 2 introduces several key modifications.

Grouped Query Attention

LLaMA 2 adopts the Grouped Query Attention mechanism in place of the vanilla self-attention to speed up inference times while maintaining the performance of standard self-attention.

Image from “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”

Image from “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”

In GQA, attention heads are divided into groups and each group shares one common key and value matrix. The method is inspired by the multi-query approach, where every attention head uses the same key and value matrix. Surprisingly, GQA has the best of both worlds, achieving comparable performance and efficiency to multi-query and vanilla self-attention.

The number of groups for each multi-head attention layer is a user specified parameter.

SwiGLU Activation Functions

LLaMA 2 opts for the SwiGLU activation function introduced by PaLM, diverging from the traditional ReLU activation function used by GPT-3. The activation function is a combination of the swish and GLU activation function, defined as:

Here, Σ is the sigmoid function, both W  and V are trainable parameters, and $Swish_\beta$ is the standard swish activation function.

SwiGLU with beta=1.5, by author

SwiGLU with beta=1.5, by author

SwiGLU is much more expensive to compute as it requires 3 matrix multiplications. However, it does boast a significant increase in performance, even when compared to ReLU in a compute-equivalent scenario where ReLU has access to a larger hidden dimension to work with.

Rotary Positional Encoding

LLaMA 2 implements the RoPE (Rotary Positional Encoding) module as opposed to the traditional relative, sinusoidal positional encoding approach. Unlike absolute or relative positional encodings which add a unique vector to each token embedding to denote its position, RoPE encodes position by rotating feature pairs in a high-dimensional space. This rotary motion is akin to a phase shift in signal processing, where each feature pair is rotated by an angle proportional to its position, ensuring that the relative position is encoded in the resulting phase difference.

Before computing the dot product for attention scores, RoPE is applied to the query and key vectors. It rotates each pair of features within these vectors based on their respective position in the sequence, effectively mixing the positional information into the representation.

The advantage of RoPE over absolute positional encodings is that it inherently captures the relative positions of features, thus enabling the model to better generalize across different sequence lengths and maintain consistency in its understanding of relative positions. Compared to relative positional encodings, which often require additional complexity to track pairwise positional relationships, RoPE simplifies the process by using rotations, which are mathematically clean and computationally efficient operations.

Mistral 7B

Mistral 7B, a rather late guest to the LLM party, emerged as the state-of-the-art model not only in the 7 billion parameters category, but surpassed the performance of many models multitudes its size.

Furthermore, its successor, Mixtral 8x7B, a mixture-of-experts model with the Mistral 7B as its base model, outperforms ChatGPT by a mile while using significantly less computation. In the current landscape of Open LLMs, almost all state-of-the-art models are based on the MIstral base model.

In contrast to the LLaMA family of models, Mistral 7B is a much smaller model.

  • d_model: 4096

  • n_layers: 32

  • d_head: 128

  • n_heads: 32

  • n_kv_heads: 8

    • Mistral adopts the Grouped Query Attention technique from the LLaMA family, with their 7B model having 8 groups of attention heads, each group of 4 heads sharing the same key and value matrices.

  • context_len: 8192

  • vocab_size: 32000

    • Both the context length and the vocab size can be further expanded through fine-tuning of the model

In addition to GQA, Mistral 7B also introduces several other additions aimed at significantly reducing the compute required in both training and inferencing while maintaining the performance of the model.

Sliding Window Attention (SWA)

The number of operations in vanilla self-attention increases quadratically with respect to the sequence length while the memory usage is linearly related to the number of tokens. Particularly, at inference time, larger models are more prone to frequent latency and smaller throughput due to the limited cache availability.

To resolve this issue, GPT-3 adopted the sparse attention mechanism while Mistral went with a much simpler approach: sliding window attention.

In sliding window attention, instead of attending to every other token in the sequence, each token only attends to a fixed W tokens prior to itself.

Image from Mistral 7B paper

Image from Mistral 7B paper

SWA is capable of maintaining the performance of the Mistral 7B model due to the nature of the Transformer architecture. SWA leverages the ability of sequentially stacked self-attention layers to transmit information beyond its explicitly set window.

For instance, the middle diagram shows the token ‘cat’ at position i attending only to itself and the two words preceding its position, illustrating SWA with a window size of 3. The hidden representation of the token is then computed using the value vectors of the words ‘The,’ ‘cat,’ and ‘sat’. Consequently, this single hidden state will contain information from all three words.

In the subsequent self-attention layer, the token at position i+W attends to position i and every token in between. Thus, it will then propagate the information from the token at position i, which already includes information about tokens from position i-W to i.

In a model with k attention layers, SWA can effectively transfer information across k * W tokens, utilizing significantly fewer resources than conventional, causal self-attention. In Mistral 7B, W  is set to 4096. Theoretically, Mistral can achieve an attention span of  W * {n layers} = 131,000 tokens.

Rolling Buffer Cache

A “rolling buffer” is like a fixed-size window that moves over a stream of data, only keeping a portion of the data in memory at any one time. As new data comes in, the oldest data is discarded.

In addition to the memory and computation reductions from the SWA, Mistral 7B utilizes a rolling buffer cache to further cut down the memory usage by 8 fold when the sequence length is increased to 32K.

The cache has a fixed size of W. At each token position i, its keys and values are stored in the (i % W) slot of the cache. When i exceeds the size of the cache, W, the modular operation will overwrite the first slot in the buffer, keeping it strictly to the current window.

Image from Mistral 7B paper

Image from Mistral 7B paper

In the above illustration, at each position i, the hidden state corresponding to the current token is colored in orange. The window size W is 4. We can see that the buffer gets filled up by the words “an” and “example”. When the next token, “of”, is processed, it overwrites the first slot in the buffer and discards any information stored about the word “This”, as it will not be attended to in the current window.

Pre-fill and Chunking

Before generating a sequence, the model “pre-fills” the cache with known information from the prompt, which means it calculates and stores the keys and values based on the given prompt. This is a preparation step so that when the model starts generating new tokens, it can refer back to this pre-computed information.

When dealing with very long sequences, storing all the keys and values in the cache would require a lot of memory despite having a rolling cache. To handle this efficiently, the sequence is broken down into smaller pieces, called “chunks”. Each chunk is processed separately.

Image from Mistral 7B paper

Image from Mistral 7B paper

Using the above figure as an example, Pre-fill and chunking follows the below process:

  1. Divide into Chunks: The model splits the long prompt into smaller, manageable parts called chunks. In the figure, the sequence is split into chunks like “The cat sat on”, “the mat and saw”, “the dog go to”.

  2. Process Each Chunk:

    1. When processing the third chunk, “the dog go to”, the model uses the pre-filled cache to refer back to the keys and values from the previous chunks (“The cat sat on” and “the mat and saw”).

    2. For the words in the current chunk “the dog go to”, the model calculates attention in two areas:

      1. Cache: It looks at the cached information from the previous chunks within a certain window (this is the sliding window), allowing it to consider recent context without going back to the very beginning.

      2. Current: Each word also attends to itself and any words before it in the current chunk (this is shown by the diagonal of ones). This is known as causal attention, meaning each word is predicted based on the words before it, maintaining the order of the sequence.

  3. Generate Text: Using the information from the cache and the causal attention within the current chunk, the model generates the next part of the text.

What Has Really Changed?

As mentioned previously, the current state-of-the-art open source Large Language Models are dominated by variants of Mistral and Mixtral and the occasional few LLaMA 2s. On the other hand, the pinnacle of proprietary models is still GPT-4.

The above architectures might have covered more than 90% of the modern LLMs. And if we look back, nothing much regarding the core mechanism of the model has gone through drastic changes. Almost all modifications aim to improve the efficiency of the model, even opting for the decoder only schema!

For one, this displays the ingeniousness of the original Transformer architecture as it was preserved and prospered through years of innovations and improvements in machine learning. Conversely, this also displays the importance of training methods, fine-tuning techniques, and most crucially, the quality of data.

In the current landscape of LLMs, people are more interested in how to achieve more for less. By increasing the quality of data, training times can be cut down; By optimizing the existing state-of-the-art architecture, the same performance can be obtained for much less cost.

Mamba: The Future of Sequence Modeling?

However, Transformer-based LLMs may not last long. The recent publication of “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” authored by Albert Gu and Tri Dao, represents a significant shift in the field of sequence modeling, particularly challenging the prevailing dominance of Transformer-based architectures. Mamba’s uniqueness lies in its structured state space model (SSM) framework, which distinguishes it from traditional Transformer models in several key aspects.

Image from Mamba paper

Image from Mamba paper

  1. Selective State Space Models: Mamba employs selective SSMs, a novel approach that allows the model to selectively propagate or forget information along the sequence length dimension. This selectivity enables Mamba to focus on relevant data while disregarding less pertinent information, thus addressing the inefficiencies of Transformers in handling long sequences.

  2. Hardware-Aware Design: Despite moving away from efficient convolutions, Mamba incorporates a hardware-aware parallel algorithm in its recurrent mode. This design not only ensures fast inference but also allows the model to scale linearly with sequence length, significantly improving upon the quadratic scaling of Transformers.

  3. Simplified Architecture: Mamba integrates these selective SSMs into an end-to-end neural network architecture without relying on attention or even MLP blocks. This simplification results in a lighter and faster model, capable of efficiently processing sequences of up to a million lengths.

  4. Versatility Across Modalities: Mamba has demonstrated superior performance across various modalities, including language, audio, and genomics, achieving state-of-the-art results. Its ability to handle both discrete and continuous data types effectively makes it a versatile tool for a wide range of applications.

What Promises Do the Future Hold for LLMs?

Mamba’s innovative approach addresses some of the fundamental limitations of current sequence models, although it still focuses on the notion of “efficiency”, Mamba has completely revamped how we think about sequence modeling. It can be one of the first attention-free architectures with comparable potential to Transformers since 2017.

On the other hand, the potential improvements that can be brought to just the attention-based Transformers through data is unfathomable. As a researcher from OpenAI stated: “Data work is very much the unsung hero of large language models…careful data work can make a huge difference in performance. As an example, PaLM 62B was much better than LaMDA 137B despite using much less compute”.

This intersection of architectural ingenuity and data optimization is shaping a new frontier in AI, forging paths towards more sophisticated and efficient language understanding systems.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo
Essential Building Blocks for Voice AI