Visualizing and Explaining Transformer Models From the Ground Up
Zian (Andy) Wang
Since the introduction of the Transformer architecture by Ashish Vaswani et al. in 2017, it has become the de facto standard for any large-scale natural language processing task. From the pioneering GPT model in 2018 to the now impressive ChatGPT, even text-to-image synthesis models such as Stable Diffusion are based on, or inspired by, the Transformer.
Given its significance as a breakthrough in the field of NLP and machine learning as a whole, it is surprising that the original paper which introduced the Transformer offers only a limited explanation of its architecture and, most critically, how the metaphorical gears of a Transformer model turn and why they work so well. This article aims to fill that gap by offering a comprehensive and illuminating illustration of the Transformer through an intuitive, visual-first approach and “decode” all the fuss around its incredible comprehension of language.
Along with the Transformer architecture that the authors of the paper proposed, they brought a revolutionary approach to sequence based modeling: the self-attention mechanism. Previously, language modeling tasks were dominated by recurrence-based techniques such as RNNs, GRUs, and LSTMs, which struggled to retain information from earlier time steps in longer sequences, resulting in poor performance on tasks involving long-range dependencies. The Transformer architecture sidesteps these limitations by allowing the entire input sequence to be processed simultaneously, enabling it to have perfect memory and significantly improved computational speed. On the other hand, the concept of self-attention is an extension from attention mechanisms introduced before the Transformer, which allows the model to focus on certain parts of the input, giving more “attention” to one part than another.
In the following section, we will delve into the fundamental methodology underlying the Transformer model and most sequence-to-sequence modeling approaches: the encoder and the decoder. This will serve as a springboard for dissecting the Transformer model architecture and gaining an in-depth understanding of its inner workings.
The Encoder-Decoder Concept
The Transformer model relies on the interactions between two separate, smaller models: the encoder and the decoder. The encoder receives the input, while the decoder outputs the prediction. Implementing an encoder and a decoder to process sequence-to-sequence data has been relatively standard practice since 2014, first applied to recurrence-based models, then later used by the Transformer.
Before the existence of the encoder-decoder architecture, predictions for sequence-based problems were solely based on accumulated knowledge over the entire input sequence “squeezed” into one sample's worth of representation. Although architectures such as LSTM and GRU attempted to improve the issue of long-range dependence, they did not completely resolve the underlying problem of RNNs, which was the inability to carry the information of long sequences through prediction fully.
In an encoder-decoder schema, the encoder takes in the entirety of the input sequence. It transforms it into a vectorized representation that contains accumulated knowledge of the input sequence at every time step. Then the entire vectorized representation of the input sequence is fed into the decoder, which “decodes” the information collected by the encoder and attempts to make a valid prediction.
In the context of the Transformer model, one can make an analogy between the encoder and the decoder as a researcher developing initial ideas and discoveries. At the same time, a programmer implements a solution based on the researcher’s findings. Think of the encoder as a researcher picking out the crucial aspects of the input, such as sentence structure, syntax, and semantics. The encoder then passes on the insights learned from the inputs to the decoder. Like a programmer, the decoder will “program” a practical solution based on the insights gained by the encoder, or, in our analogy, the researcher.
The decoder receives a separate sequence of inputs before decoding the information provided by the encoder. The decoder bases its predictions not only on the original input sequence but also on its previous outputs. It is auto-regressive in nature. Every time the decoder makes a prediction, it produces a single word and then concatenates it with words predicted by the model in previous time steps. This prediction loop continues until the maximum number of outputs is reached—usually a hyperparameter specified by the user—or the model deduces that it has reached the natural end of a sentence or phrase.
Before diving into the nuts and bolts of the Transformer model, it's necessary to have a broad understanding of how information flows through the various components of the model and in what ways these components are essential to understanding the fundamental structure of language, all the while (hopefully) producing a more-or-less correct output. The image below is extracted from the original paper describing the Transformer model architecture.
Let's see how these components fit together.
The encoder of a Transformer is responsible for turning an input sequence into a machine-readable representation. This representation captures the similarity between words and their relative position in the sequence. The input sequence is first passed through an input embedding and a positional encoding layer. These operations transform the input sequence into a form suitable for processing by the encoder layer.
The encoder layer is the encoder's core, where the bulk of the "magic" happens. In the original paper, it's suggested to stack the encoder layer six times. However, this can be adjusted depending on the situation. The encoder consists of a single multi-head attention block followed by a feed-forward neural network with residual connections and layer normalizations after both outputs.
The multi-head attention block is capable of discovering complex relationships between words and determining how each word contributes to the meaning of the input sequence. This allows the encoder to gain a deep understanding of the input sequence in a way that is similar to how humans analyze language. After the encoder layer, a feed-forward network further transforms the input sequence in preparation for the next encoder layer. Once the encoding process is complete, the accumulated knowledge gained by the encoder (the output of the last encoder layer) is passed on to the decoder, which uses it to generate the final output sequence.
The decoder receives the accumulated knowledge produced by the encoder, as depicted in our figure below.
Typically for the very first "prediction cycle," the decoder is fed with a "start of sentence" token since there were no "previous'' outputs. The decoder layer is similar to the encoder layer because it utilizes the same concept to analyze the combined information from the encoder and previous predictions. The decoder first derives insights solely based on its acquired prediction knowledge. Then, the data is combined with an encoder output to be further processed and analyzed. Finally, the prediction for the next time step is outputted as a probability of how likely the word chosen is the next in the sequence of outputs.
The objective of word embedding is to convert the given input sequence into machine-readable representations. One way to do this would be to use one-hot encoding, where each word is represented by a large, sparse vector with a non-zero value at the word's corresponding index in the vector space. However, this approach is neither effective or elegant, as it results in huge vectors, 99% of which are zero values, which can negatively affect model performance due to the curse of dimensionality. Additionally, for such a huge vector, the information that it's able to convey—a unique identifier for a word—is astonishingly small.
Word embeddings apply a further transformation learned from large corpora to the sparse, one-hot encoded vectors, producing dense, relatively low-dimensional representations of words while considering contextual information. For instance, in the sentence “the cat is brown and furry while the refrigerator is a lifeless, silver-colored machine,” the word embedding for “cat” would be far away from the word embedding of “refrigerator” since these two words convey completely different meanings. On the other hand, the word “brown” and “silver” would have similar embeddings as both words are used to describe colors. Typically, cosine similarity is used to compute the distance between two word embedding vectors.
One can view the task of word embedding as a pre-training technique for language models. Without using word embeddings, the model would have to learn contextual information about each word during training. Word embedding accomplishes this beforehand, allowing more information to be fed into the model.
Various methods and pre-trained algorithms are proposed to generate densely packed word embeddings effectively. However, the Transformer uses a word embedding attached to the model and is trained from scratch. Using a word embedding without pre-initialized parameters allows the model to learn the embedding representation with respect to the context of the input data and the overall model structure. The word embedding is the very first component in the transformer; the one-hot-encoded input is multiplied with a weight matrix trained through the backpropagation of the entire Transformer model.
The weight matrix has a shape (number of unique vocabularies, dimension of embedding). For the sake of consistency, we will use the terminology from the original paper on the Transformer and refer to the dimension of the word embedding as “d_model”. The value of d_model is set to 512 by the authors of the Transformer paper.
Word embeddings can be viewed as a lookup table that maps input, one-hot-encoded vectors to a lower dimensional space of shape (length of sequence, d_model). This dimensionality reduction is achieved through the multiplication of the input vectors and a weight matrix. These embeddings are able to capture dependencies between words and provide contextual information. A common illustration of this is to subtract the embedding vector of "man" from "king" and obtain a vector similar to the embedding vector of "queen," implying the inherent relation between the three words.
Word embedding may be able to inject useful insights into the input sequence; it is unable to provide any positional clues to the model of where each word is relative to the input sequence. This is where positional encoding comes into play.
As a refresher, the Transformer architecture discards the use of recurrence-based networks, instead relying on self-attention mechanisms to process input sequences. While this allows for faster training and improved handling of long-range dependencies, it does not inherently provide any information about the relative positions of words in the input.
For instance, the sentences “I pet my dog” and “The dog pet me” convey distinct meanings depending on the position of “dog.” Despite this, the word embedding for “dog” in both cases is identical. Recurrence-based models process information in order, thus, the position of each word is already implied, but the Transformer requires additional information to differentiate the two “dogs” in these sentences.
To address this issue, positional encoding is employed, which adds a unique vector of length d_model to each word embedding vector. This positional encoding vector is determined by the position of the word within the input sequence. This allows the model to extract the relative positions of words in the input and incorporate this information in its processing.
In the equation above, “pos” represents the position of the word in the input sequence while “i” is the position of each value within the word embedding. Both positional encoding functions are applied, producing two unique values for each “i” value. Thus, to output vectors of length d_model, the value of “i” will range from 0 to half of d_model. For example, in the sentence “How are you?" the word embedding for “you” would have a pos value of 3 while its “i” value ranges from 0 to 255 since d_model, or the embedding dimension is 512.
Sinusoidal position encoding carries several advantages::
It has been stated in the original paper that the Transformer is able to “extrapolate to sequence lengths longer than the ones encountered during training” with the positional encoding function.
The relative position between words can be extrapolated since, for words in positions close to each other, their positional encoding vector will be similar as well.
After the positional encoding component, its output, with shape (length of sequence, d_model), will be passed into the first encoder layer consisting of the self-attention block and the feed-forward neural network.
Note that the same preprocessing scheme (word embedding and positional encoding) is used for the input sequence to the decoder as well, which we will discuss later.
Intuition Behind Self-Attention
Before getting into the technical details of self-attention, it’s critical to have a general intuition of the whys and the hows behind the algorithm.
Attention, or global attention, in general, is nonetheless one of the most important contributing factors to successful natural language processing models. The basic idea behind attention is that the model can focus on certain input words more than others, depending on their relevance to the context. In other words, the model assigns varying degrees of "attention" to each input word, with more important words receiving more attention.
For example, consider the following sentences: "My dog has black, thick fur as well as an active personality. I also have a cat with brown fur. What is the breed of my dog?" Without attention, the model would assign equal importance to the information about the cat and the dog, which could lead to incorrect or misleading answers. However, with attention, a well-trained language model would assign less attention to the phrase "brown fur" because it is irrelevant to the question being asked. This ability to selectively focus on important words is a crucial component of language learning and helps improve the performance of natural language processing models.
The Transformer employs a self-attention mechanism called "scaled dot product attention." While global attention considers the importance of each word relative to the entire input sequence, self-attention deciphers dependencies between words within the sequence. For example, in the sentence "I went to the store and bought tons of fruits along with some furniture. They tasted amazing," a human reader would infer that "they" refers to the fruits, not the furniture. A model using global attention might assign higher attention values to "fruits," "furniture," and "amazing" without understanding the relationship between these words. In contrast, self-attention, which compares every word in the input sequence to every other word, would be able to discover the intended meaning of "they."
A helpful way to understand how self-attention assigns attention values or weights is to construct a correlation matrix by comparing each element of the input sequence with every other element in the sequence.
Self-attention is a mechanism that is analogous to information retrieval systems like Google search. In self-attention, there are three components: the query, the key, and the value. In an information retrieval system, the query is like the search query you enter into the search bar, the keys are like the titles of websites in the database, and the values are like the websites themselves. When you enter a search query, the system will compare it to the keys in the database and rank the values based on how similar the keys are to your query. While the actual Google search engine is much more complex than this, this simple example illustrates the roles of queries, keys, and values in self-attention.
The ideas of queries, keys, and values, as well as their interactions, are imitated by self-attention. Each word's vectorized representation is projected into three smaller-sized vectors that represent the word's key, value, and query.
The key vector of every word in the input sequence, including itself, is compared with each word's query to determine the best match. As a result, the input sequence generates the attention matrix that is seen above, with each value in the matrix representing an "attention weight" for a particular combination of words. For each word, their attention weights against every word in the input sequence (that is, the attention values in the entire row of the attention matrix) are used as weights to calculate a weighted sum of their corresponding value vectors. This is done for every word. A new matrix is created with those elements as its entries. This essentially recreates the word embedding representation of the input sequence but with attention information.
Math Behind Self Attention
Recall that the output from positional encoding is in the shape of (length of sequence, d_model) where d_model can be interpreted as the embedding dimension. This matrix is the input to the encoder layer. For encoder layers that follow the first one, their input will be the output from the previous encoder layer.
The input matrix is linearly projected into three smaller matrices representing the queries, keys, and values through three separate weight matrices. These matrices have the shape (length of sequence, 64), where dimension 64 is an arbitrary value chosen by the authors of the paper. The weight matrices for the query, key, and value are referred to as WQ, WK, and WV, respectively. These weight matrices are trained through backpropagation along with the entire model. The value of 64 chosen for the dimensions of the matrices does not affect the computation of self-attention.
Each entry in the query, key, value matrices corresponds to the word’s query, key, and value. The query and key matrices are dotted together to produce the “correlation matrix” shown above.
To understand why taking the dot product between the query and key matrices would produce similarity scores between words, remember that the entry in the query matrix corresponds to the vectorized representation of the word in the query, while the entry in the key matrix corresponds to the vectorized representation of the word in the key.
The dot product between two vectors in a two-dimensional space can be seen as a measure of the cosine similarity between the vectors, scaled by the product of their magnitudes. As an example, consider the sentence, "The man walks down the busy road carrying some books he just bought; where could he be coming from?" For the sake of visualization, let's assume that the dimensionality of the model (d_model) is equal to 2 so that the query and key vectors can be projected into a 2-dimensional space.
In this example, let's take the query vectors for the words "man" and "busy," along with the key vectors for the words "books" and "road." We would expect the query and key vectors for "man" and "books" to have a relatively large cosine similarity since the word "books" describes the "man." Similarly, we would expect the query and key vectors for "busy" and "road" to have a relatively large cosine similarity as well.
However, to a human analyzer, the relationship between "busy" and "road" may not be as relevant to the question being asked in the sentence as the relationship between "man" and "books." In other words, the phrase "busy road" might not be as useful in inferring where the man is coming from as the relationship between "man" and "books," which suggests that he might be coming from a library.
To illustrate this difference, the relative magnitude of the query and key vectors produced by the query and key weight matrices for "busy" and "road" would be smaller than for "man" and "books." Since the dot product measures not only the cosine similarity but also the product of the magnitudes of the vectors, the final dot product would give more weight to the relationship between "man" and "books" than to the relationship between "busy" and "road."
The attention matrix produced by the dot product of the query and key matrices has a shape of (length of sequence, length of sequence). Each value in the attention matrix is divided by the square root of the size of the key, query, and value matrices (8, in this case). This step is used to stabilize the gradients during training. The attention matrix is then passed through a softmax function, which normalizes its values to be between 0 and 1 and ensures that the values of each row in the matrix sum up to 1.
As mentioned earlier, a weighted sum is performed using the attention values and the value vectors. The normalization of the attention scores to sum up to 1 makes this weighted sum operation possible. Finally, the normalized attention matrix is dotted with the value matrix, producing a matrix of size (length of sequence, 64), which can be seen as a smaller, vectorized representation of the input sequence with attention information.
The first row of the output matrix is a weighted sum of the row vectors in the value matrix, with the weights being the attention values of the first word against all other words in the input sequence.
Note that the output matrix has a size of (length of sequence, 64) and not (length of sequence, 512). The reason for this will become clear in the next section, where we discuss multi-head self-attention and the process of upscaling the "attention word embedding." It is important to remember that the output matrix should have the same size as the original word embedding since it will be used as the input to the next encoder layer, which, in the case of the first encoding layer, expects the word embedding as input.
Multi Head Self Attention
Multi-head self-attention is precisely what it sounds like: applying multiple “attention heads” to the same sequence. The exact self-attention mechanism is reapplied eight times to the same input sequence in parallel. For each attention head, its query, key, and value weight matrices are initialized randomly in hopes that each head will capture different types of information from the input sequence.
Each attention head produces a matrix of shape (length of sequence, 64); they are then concatenated along their second dimension, creating a matrix of shape (length of sequence, 8*64). A linear projection is performed on this matrix to “combine” the knowledge of all the heads. The weight matrix used for the linear projection is trained through backpropagation along with the rest of the model.
To recap, and for those who like things presented numerically, the multi-head attention mechanism can be written as:
In the original Transformer paper, the authors used eight attention heads. However, later research showed that this might not be necessary. In the paper “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned,” Elena Voita et al. proposed that within the eight attention heads, there are three “specialized” attention heads which do most of the work. Specifically, these specialized attention heads’ roles have been hypothesized as such:
Positional head: this attention head is responsible for discovering the relationship between the relative positioning of words. Typically the highest attention score for each word points to an adjacent token.
Syntactic head: this attention head is responsible for analyzing the syntactic relationship between words.
Rare words: this attention head filters out words that infrequently appear in an input sequence, indicating that they could be more important than more commonly occurring words, such as “the” and “a”.
Add, Norm, and Feed Forward Networks
The output from the multi-head attention block is passed through a layer normalization component, which applies normalization to the inputs of the encoder layer. This normalization helps to improve the stability and speed of training for the transformer model by ensuring that the inputs have a consistent distribution. It also helps to reduce the effect of vanishing gradients, which can slow down the training.
The key difference between batch and layer normalization is the method that they use to normalize the inputs. In batch normalization, which is more common, each feature is normalized independently across the entire batch, while in layer normalization, each sample is normalized independently across the whole batch.
The layer normalization component also has a residual connection, which allows the input to be directly added to the output of the normalization layer. This residual connection helps to improve the flow of gradients through the model during training, which can further improve the stability and speed of training.
After the output has been normalized, it is passed through a shallow three-layer feed-forward network, which processes the output and generates the final encoded representation of the input sequence. The feed-forward network consists of two linear transformations with a ReLU activation function in between, which helps to capture complex relationships in the data.
This final step completes the encoding layer of the transformer model. In particular, the input and output both have d_model neurons, or 512, in the original paper, and the middle hidden layer has 2048 neurons. Another layer normalization and residual connection (Add & Norm) are employed after the shallow three-layer feed-forward network.
The output from the last encoder layer goes through another set of linear projections learned by backpropagation similar to those performed in the self-attention block, producing a key and a value matrix to be fed into the decoder.
After that exhaustive explanation of the encoder side of things, let's dig deeper into decoders.
Decoder Training and Inference Scheme (a.k.a. Teacher Forcing)
The decoder processes its inputs in the same manner as the encoder - first through word embeddings and then through positional encoding. To briefly review, information flows through the decoder by initially feeding it a single "start of sentence" token. The decoder then generates the next output in the sequence, which is subsequently used as the input concatenated with the input from the previous time step.
This iterative process continues until the decoder produces an "end of sentence" token or the user-specified limit is reached. This approach allows the decoder to generate outputs autoregressively, one element at a time. Note that the decoder also receives the same key and value matrices produced by the encoder at every decoder layer at every prediction time step.
During training, we can take advantage of the fact that we have access to the full target sequence. Instead of training the decoder autoagressively, the input to a decoder at each time step is the expected output from the beginning of the sequence to the previous time step.
This allows the decoder to make predictions based on the known sequence rather than its own predictions, which can improve the overall performance of the model. This training technique is known as teacher forcing. Teacher forcing can also allow for parallelization of training as the output for each time step can be computed independently.
Decoder Masked Attention
The decoder layer is similar in structure to the encoder layer, but it contains two attention blocks instead of one, and each of the attention blocks works slightly differently than the encoder’s.
The first attention blocked is the masked multi-head self-attention. This attention mechanism works the same way as the encoder's self-attention except when computing the attention matrix. Since this attention component calculates self-attention, it does not receive the key and value matrices from the encoder; those are for the next attention block.
The first decoder self-attention block uses a mask to prevent the model from attending to future words in the sequence during inference. To understand why this is necessary, consider the self-attention matrix. In the first row of the matrix, each cell (other than the first one) represents how the first word attends to future words in the sequence. Since the decoder is trained using teacher forcing, the words that the decoder attends to in the first row of the matrix are the correct words. However, during inference, the model only has access to its own outputs, so these words will not necessarily be the correct ones.
In order to prevent the model from attending to future tokens in the output sequence during inference, a mask is applied to the attention matrix to mask out the attention values that are ahead of the current time step for each row. This helps the model to focus on the words it has generated so far and improve its predictions.
Consider the sentence "fruits are delicious," as shown in the figure above. The attention matrix is used to calculate the relationship between each word in the sentence and the rest of the words in the sequence. In a realistic scenario, when the model is calculating the relationship between the first word "fruits" and the rest of the sentence, it should not have access to the correct output words "are" and "delicious".
It is only when the model reaches the word "delicious" that it makes sense to compute attention against the other words in the sequence, as they appear earlier in the sentence. We can then assume that those earlier words are “predicted” by the model (they’re not since we’re using teacher forcing) just like during inference.
To formulate masking in terms of matrices, masking essentially applies a triangular mask to the top right portion of the attention matrix, setting those values to -inf thus preventing the model from using that information. Note that the query, key, and value matrices are still being computed the same way as those were in the encoder self-attention, the masking only comes into play before the softmax operation where the top-right triangular portion of the attention matrix is set to -inf. Once passed through the softmax operation, those values will be “squeezed” to 0, nullifying their influence.
During the inference/prediction stage, the masking is removed during inference as the model is basing its predictions solely on its own outputs from previous time steps. Everything else about multi-head attention, add and layer normalization remains the same mechanism compared to the encoder self-attention.
Decoder Cross Attention
The decoder cross attention block is a crucial part of the transformer model. After the masked multi-head self-attention block and the add and layer normalization, the decoder uses another attention block to combine the output from the encoder with its own input. This is the first time that the encoder and decoder are dealing with sequences from each other's input. The decoder computes the query matrix from its own input, while the key and value matrices come from the encoder output.
Then, attention is performed as usual, but this time the focus is on finding relationships between the input sequence and the output generated so far by the decoder. This step allows the decoder to consider how the entire input sequence relates to what has already been outputted, and use this information to make more accurate predictions about the next word in the sequence.
The decoder cross attention mechanism can be thought of as a way of reconstructing the "word embedding" representation of the decoder input, but instead of using self-attention information, it uses information from the input sequence. Essentially, each row in the newly produced decoder "word embedding" is a weighted sum of the value vectors from the encoder, where the weights are determined by the relationship of each word in the input sequence to the given word from the decoder inputs.
The feed forward network functions the same way as the ones in the encoder layers. In the original paper, the decoder layer is stacked six times as well.
Decoder Linear and Softmax
The last layer of the transformer decoder produces a matrix of size (length of sequence, d_model). As mentioned previously, this output can be interpreted as an enhanced version of the original word embedding, incorporating insights from the 6 encoder and 6 decoder layers. This output is then fed through a learned linear transformation, which maps the matrix to a vector with length equal to the number of vocabularies. This vector is then passed through a softmax function to convert the values into probabilities. Each index of the vector represents a unique word, and the index with the highest probability is the model's prediction for the next word in the sequence.
Congratulations! That was the entire Transformer architecture. It may seem daunting at first, but when broken down into individual components, it's simply a combination of weights and matrices working together to analyze language in a similar way to humans. Here are some additional bits and pieces of information about the general training scheme of the Transformer that weren't mentioned in the explanation of the architecture:
There are various methods for training a Transformer, and the teacher forcing approach is just the most commonly used one.
The typical loss function used is the cross-entropy loss, which compares the probability distribution output by the model to the ideal probability distribution. The ideal distribution has a value of 1.0 for the element representing the correct output, and all other elements are zeroed out.
The learned linear projection that transforms tokenized text into word embeddings is shared between the encoder and decoder.
While this article has provided a detailed overview of the Transformer’s key components and how it works, there is still much more to explore and learn about this fascinating model. Its continued development and application in various domains hold the promise of exciting advancements in the future.
Image Credits: The second image in this article is a screen-capture from the original paper describing Transformer model architecture, "Attention is All You Need" by Vaswani et al. (2017). All other diagrams and images contained in this article were created by the author.