Capturing Attention: Decoding the Success of Transformer Models in Natural Language Processing
Zian (Andy) Wang
The Transformer model, introduced in the paper “Attention is All You Need,” has influenced virtually every subsequent language modeling architecture or technique. From novel models such as BERT, Transformer-XL, and RoBERTa to the recent ChatGPT, which has enthralled the internet as one of the most impressive conversational chatbots yet. It is clear that transformer-based architecture has left an undeniable imprint in the field of language modeling and machine learning in general.
They are so powerful that a study by Buck Shlegeris et al. found that Transformer-based language models outperformed humans [1, see references below] in next-word prediction tasks, with humans achieving an accuracy of 38% compared to GPT-3’s 56% accuracy. In a Lex Fridman podcast interview, computer scientist Andrej Kaparthy stated that “the Transformers [are] taking over AI” and that it has proven “remarkably resilient,” going even further by calling it a “general purpose differentiable computer.”
Transformers and their variants have proven to be incredibly effective in understanding and deciphering the intricate structure of languages. Their exceptional ability for logical reasoning and text comprehension has generated curiosity about what makes them so powerful.
In the same podcast, Kaparthy mentioned that the Transformer is “powerful in the forward pass because it can express […] general computation […] as something that looks like message passing.” This leads to the first critical factor contributing to the Transformer’s success, the residual stream. 
The Residual Stream
Typically, most people’s understanding of neural networks is that they operate sequentially. Essentially, each layer in the network receives a tensor (for those unfamiliar, in our context, “tensors” are just processed matrices from previous layers/input) as an input, processes it, and outputs it for the subsequent layer to consume. In this way, each layer’s output tensor serves as the subsequent layer’s input, creating a dependency between layers. Instead, transformer-based models operate by extracting information from a common “residual stream” shared by all attention and MLP blocks.
Transformer-based models, such as the GPT family, comprise stacked residual blocks consisting of an attention layer followed by a multilayer perceptron (MLP) layer. Regardless of MLP or attention heads, every layer reads from the “residual stream” and “writes” its results back to it. The residual stream is an aggregation of the outputs from prior layers. Specifically, every layer reads from the residual stream through linear projection and writes to it through a linear projection followed by addition.
To provide a high-level understanding of how the residual stream contributes to Transformer-based models, imagine a team of data scientists working on a machine learning problem. They start by collecting and storing a large dataset in a database. They divide the work among themselves based on their specialties. The first few members of the team, like data engineers and data analysts, perform initial processing on the data. They might clean and organize it or identify patterns and trends.
They then pass the “preprocessed” data to the machine learning engineers, who use it to build and train predictive models. In a transformer-based model, we can think of each layer as a team member, with each layer (or series of layers) having its specialty, and they can learn to work together like a real team through training. For example, a set of attention blocks might work together like a “data engineering” team responsible for moving and manipulating data samples and removing useless information (and yes, in reality, attention blocks can do that!).
On the other hand, a series of MLP layers towards the end of the model can act as tiny neural network ensembles to transform the processed dataset into meaningful outputs. Furthermore, like how a real team requires collaboration to succeed, every layer in the model can communicate with any layer ahead of it! Now, what does this have to do with the residual stream?
The residual stream is the backbone that stitches the layers together, allowing them to function like the roles and jobs described above. It is the “database” in which the dataset is stored and the “messaging app” that every layer uses. It is a pool of aggregated knowledge from every “team member.” Any of those data engineering layers can extract a subset of the data from the database, do some fancy “feature engineering” to it, and put it back in the database. Similarly, any layer can ignore data that does not “fit” its specialty.
Additionally, each layer can send “data” to any future layers, and only those layers can receive this “data.” This allows a specific subset of information to be only processed by layers with appropriate functionalities. Communication can also take the form of layers “requesting” information from a set of previous layers. The presence of the residual stream poses an entirely different scheme of learning and significant flexibility that typical feed-forward neural networks do not possess. Without it, every layer of the network would be “isolated” data scientists without much of a specialty, since every layer has to deal with the entire dataset. Just imagine if every team member could only talk to the person sitting next to them!
One crucial implication of the residual stream, as implied from the comparison above, is that the model tends to be more localized, and each layer has its own freedom to process information presented to them selectively. Empirically, localization in Transformer-based models has been observed well before the notion of the residual stream was popularized.
For instance, Ian Tenny et al.’s paper “BERT Rediscovers the NLP Pipeline”  mentions that “syntactic information is more localizable, with weights related to syntactic tasks tending to be concentrated on a few layers” in an analysis of BERT models. Furthermore, in the AlphaFold2 paper  “Highly accurate protein structure prediction with AlphaFold”, John Jumper et al. discover that “For very challenging proteins such as ORF8 of SARS-CoV-2 (T1064), the network searches and rearranges secondary structure elements for many layers before settling on a good structure.”
The AlphaFold2 paper implies that the model is able to extract subsets of what has already been analyzed by other layers, further process it, and add it back to the original data. Then, layers further down the network would take the same subset and continue to revise and improve the analysis done by previous layers. Remember from the comparison above that past layers can send information to future layers that are not necessarily directly adjacent. The information will only be read and understood by those “receiving layers.” The mechanism behind this sort of “communication” is quite simple.
The residual stream is typically a high-dimensional vector space that is multitudes “wider” than the dimension of each attention layer. Here, the term “width” refers to the hidden size (equal to the embedding dimension) of the model, which can be up to hundreds if not thousands (1024/768 in the case of BERT and GPT). In contrast, attention layers encode information from the residual stream to smaller dimensions (64 in the case of BERT and GPT). When each attention layer “writes” information to the residual stream, they can write to entirely different subspaces in the residual stream and not interact with each other. On the other hand, each individual attention layer can also read in from specific subspaces written by other attention heads, effectively achieving a “communication” effect.
Intuitively, it can be helpful to think of each layer as having its own “encoding language” that encodes its output as it writes to the residual stream. This results in different types of signals in the residual stream that can only be interpreted by layers that possess a “decoding language” that allows them to read and understand the information from previous layers attempting to communicate with them.
Now you might ask, “The paper that introduced the Transformer is titled ‘Attention is All You Need’, not ‘Residual Connections are All You Need’, ResNets have long existed before Transformers; how do attention layers contribute to the success of Transformers?” This is where specialized attention heads come in, the residual stream might serve as the “backbone” of the ML team, but without intelligent and efficient “team members”, Transformers would not be any different than a typical MLP. Nevertheless, before talking about specialized attention heads, research offers a different way of seeing the functionality of multi-head attention layers than the usual understanding.
Multi- Head Attention
In the original Transformer paper, “Attention is all you need,"  multi-head attention was described as a concatenation operation between every attention head. Notably, the output matrix from each attention head is concatenated vertically, then multiplied by a weight matrix of size (hidden size, number of attention heads). However, notice that the weight matrix can be “decomposed” into column vectors, one for each attention head. If we do the algebra, we see that multi-head attention is simply an additive process in which each attention head’s outputs are multiplied by its own “result matrix” and summed together!
The summation reveals that each attention head works entirely independently. This means that each attention head does not have to provide similar functionality, but rather each head in every attention layer can do completely different things. The original explanation for multi-head attention in the “Attention is all you need” paper offers a somewhat blurry interpretation of whether each attention head is functionally independent. It may be more computationally efficient to concatenate than multiply, but the summation definition aligns more closely with empirical evidence and findings.
A large amount of the expressiveness and complexity of understanding that Transformer-based models possess stems from the fact that each query, key, and value matrix from every attention head can “communicate” through the residual stream. Therefore, attention heads can “work together” to achieve operations more complex than simply analyzing the naive token representation. For instance, one attention head might replace phrases with vague meanings with their intended tokens found earlier in the text. Then a future attention head’s query matrix may be composed of these “replaced” phrases, which helps it better compute the attention matrix. Remember, these changes can only be seen by the layers that “need” them, while other layers can still access the original, unchanged data if needed, thanks to the large “bandwidth” of the residual stream.
Have you ever wondered how ChatGPT remembers everything you say with incredible accuracy? Well, the answer lies in the formation of these specialized attention heads within Transformer-based models called induction heads.  Specifically, these heads are the cause of a phenomenon called “in-context learning,” where the model loss decreases the more tokens it predicts (this is why ChatGPT can remember so far back into your conversations). Analogous to inductive reasoning, induction heads conclude that if phrase/word [A] is followed by [B] earlier in the text, the next time we encounter [A], it is likely to be followed by [B] as well. Moreover, induction heads are more potent than simply recognizing and repeating a pattern. In large language models, induction heads understand more abstract representations where they can generalize the pattern of [A] -> [B] to phrases that do not precisely match [A] but represent a similar idea in the word embedding. To see this rather complex “pattern matching,” refer here for a live demonstration.
Take the sentence “The cat jumped off the cliff” as an example. Induction heads would remember that “the cat” is typically associated with “jumped off the cliff.” Next time it encounters the phrase “the cat”, those token’s attention values will point to the previous occurrence of “jumped off the cliff,” thus increasing the chance that the model will output “jumped off the cliff”. More generally, induction heads search for previous occurrences of the current token and copy what comes after it according to the current context.
However, if we “take a look under the hood,” induction heads’ functionality is actually a collaboration between multiple attention heads spanning the network. An attention head earlier on in the network would tell the key matrix of induction heads to extract tokens one position back (remember, this kind of communication can be achieved through the residual stream). Next, when the induction head computes attention scores, the query matrix would look for similar tokens in the key matrix to attend to. But since the token representation in the key matrix is shifted one position backward, the computed attention matrix is actually paying attention to positions one after the actual position of the word provided by the key matrix. Essentially, this creates the effect of the model paying high attention to “jumped off the cliff” for the phrase “the cat” since in the key matrix, what’s supposed to be “jumped off the cliff” is replaced by “the cat” and its tokens.
Remember that the above explanation is a rather intuitive, non-technical interpretation of induction heads. To see how the math works, refer here. Interestingly, not all induction heads in Transformer-based models rely on attention heads earlier in the network, copying the representation of previous tokens to the induction heads’ key matrix. Attention heads in some networks, such as GPT-2, have been observed to extract the positional embedding of similar tokens and “tell” the induction heads to rotate those positional embeddings forward, getting the tokens that come after the current one.
The pattern-matching characteristic of induction heads can be generalized to remembering a specific type of phrase almost always adjacent to some other type. One obvious implication or job that an induction head accomplishes is discovering grammatical rules, such as how an adverb always addresses a verb or how an adjective always describes a noun. On a larger scale, induction heads can maintain a consistent writing style and provide contextual information to tokens situated later in the text.
The Versatility of Transformer-based Models
Of course, induction heads and the residual stream are not just the only factors that influenced the extraordinary performance of Transformer-based models, not even close, but they sure play a huge role in the “intelligence” of Transformers. Especially with modern-day large language models (LLMs) containing hundreds of attention heads, each could serve an individual purpose in comprehending the input text at nearly, if not above, the human level. Researchers are far from fully interpreting LLMs. But, one recurring theme observed throughout the attempts at breaking down Transformer-based models is the importance of the residual stream and the specialization of each component that it brings.
Ultimately, the residual stream opens the door for models to think and learn like humans. We are not black boxes that pile everything that we know into one stream of operations: we have the ability to break down the problem, analyze it from different perspectives, and synthesize knowledge from different areas to arrive at a solution. Transformer-based models can allude to the tools and strategies that humans can leverage; their architectural design is not just a massive step in NLP but rather a game-changing concept in machine learning. Transformer-based models have not only been proven to stand the test of time in the field of NLP. Rather, they are one of the most versatile architectures, being actively used in image processing [7, 8, 9], tabular data [10, 11], and even in recommendation systems , reinforcement learning , and generative learning [14. 15]. Transformers are not just a model design but a step closer to allowing AI systems to think like humans.
Shlegeris et al (2022). Language models are better than humans at next-token prediction. https://arxiv.org/abs/2212.11281
Elhage, et al (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
Tenney, I., Das, D., & Pavlick, E. (2019). BERT Rediscovers the Classical NLP Pipeline. Annual Meeting of the Association for Computational Linguistics.
Jumper, J.M., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Zídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S.A., Ballard, A., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D.A., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A.W., Kavukcuoglu, K., Kohli, P., & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583 - 589.
Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. https://arxiv.org/abs/1706.03762.
Olsson, et al. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://arxiv.org/abs/2010.11929.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. https://arxiv.org/abs/2005.12872.
Yang, F., Yang, H., Fu, J., Lu, H., & Guo, B. (2020). Learning Texture Transformer Network for Image Super-Resolution. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5790-5799.
Huang, X., Khetan, A., Cvitkovic, M., & Karnin, Z.S. (2020). TabTransformer: Tabular Data Modeling Using Contextual Embeddings. https://arxiv.org/abs/2012.06678.
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., & Goldstein, T. (2021). SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training. https://arxiv.org/abs/2106.01342.
Pohan, H.I., Warnars, H.L., Soewito, B., & Gaol, F.L. (2022). Recommender System Using Transformer Model: A Systematic Literature Review. 2022 1st International Conference on Information System & Information Technology (ICISIT), 376-381.
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., & Mordatch, I. (2021). Decision Transformer: Reinforcement Learning via Sequence Modeling. https://arxiv.org/abs/2106.01345.
Jiang, Y., Chang, S., & Wang, Z. (2021). TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up. Neural Information Processing Systems.
Hudson, D.A., & Zitnick, C.L. (2021). Generative Adversarial Transformers. https://arxiv.org/abs/2103.01209