From Turing To GPT-4: 11 Papers that Shaped AI's Language Journey
, Jason D. Rowley
If you’re just getting started in the world of artificial intelligence and applied machine learning, you’ve taken the plunge during what’s probably the most exciting period in the history of the field (at least, so far).
As we write this, in late July 2023, it seems like there’s some new paper, evermore capable model, or novel tech demonstration released every day. And it’s felt like this… for months. One might’ve surmised that mid-March marked some local maximum in AI releases—GPT-4, Midjourney V5, Stanford’s Alpaca 7B, and PyTorch 2.0 were all released in the same week—but the dizzyingly fast AI race continues apace.
All this being said, it’s important to keep in mind that it’s been a long, often uphill slog—with plenty of booms and busts along the way—to get to this point. Not even including the millenia-long evolution from mechanical computers like the Antikythera mechanism to digital computers like ENIAC, and centuries of mathematical discoveries that led to the development of formal computational logic in the 19th century, the field we call “artificial intelligence” is only a little over 70 years old.
The Briefest History of Language AI
Unlike how in the latter half of 2022—when generative AI image models like DALL-E 2, Midjourney, Stable Diffusion, and others had a fleeting, if nonetheless exceedingly bright, moment in the spotlight—interest in large language models has had a bit more of a long, slow burn before bursting into the public consciousness.
Here’s a quick timeline of how we got to the present day:
The Imitation Game (October 1950). Alan Turing published his paper, “Computing Machinery and Intelligence,” in Mind. In it, he explains The Imitation Game, commonly known today as “The Turing Test.” Here’s how it works: a human interrogator communicates with two hidden entities, one human and one machine, through text-based messages. The interrogator's objective is to determine which entity is the machine by posing questions; if the machine can consistently imitate human responses and deceive the interrogator, it is said to pass the test, exhibiting a form of artificial intelligence.
Backpropagation (October 1986). Rumelhart, Hinton, and Williams publish “Learning Representations by Back-propagating Errors,” a foundational text in modern machine learning. It describes the backpropagation technique for training a neural network. Basically, backpropagation works by comparing the network's output to the correct answer, then going backwards through the network, adjusting its internal "knobs" (called weights) to reduce the difference between the predicted and correct answers, ultimately making the network better at its task.
Neural Language Models (February 2003). Based on research completed in 2001, Bengio, Ducharme, Vincent, and Jauvin published “A Neural Probabilistic Language Model” in the Journal of Machine Learning Research. In the paper, the authors introduce the idea of using neural networks to learn distributed word representations, paving the way for applied deep learning in natural language processing tasks.
Unified Deep Neural Architecture for NLP (July 2008). Collobert and Weston published their paper, “A Unified Architecture for Natural Language Processing: Deep Neural Networks With Multitask Learning,” which introduced a novel approach to NLP: Using a single deep neural network to accomplish a variety of tasks. Another significant contribution of the paper was pre-trained word embeddings, which were learned from a large text corpus without supervision.
Word2Vec (December 2013). Mikolov et al. published “Distributed Representations of Words and Phrases and their Compositionality,” which introduced the Word2Vec algorithm, which changed the way machine learning models learned word embeddings. Word2Vec represents words as dense vectors in a continuous space, capturing semantic and syntactic information based on their contexts.
RNN Encoder-Decoder (September 2014). Cho et al. published their paper, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” which, as the title suggests, introduced the RNN Encoder-Decoder framework. This framework marked a departure from traditional statistical machine translation methods by employing an end-to-end neural network architecture to learn both the input and output sequences' representations.
seq2seq (December 2014). In their paper, “Sequence to Sequence Learning With Neural Networks,” Sutskever, Vinyals, and Le introduce sequence-to-sequence models. The seq2seq framework has been highly influential in the field of NLP and has been applied to a wide range of tasks, including machine translation, text summarization, question-answering, and conversational AI. The introduction of seq2seq learning marked a significant milestone in NLP research, as it enabled neural network models to handle variable-length input and output sequences, opening up new possibilities for deep learning applications in NLP.
Transformer models (December 2017). Vaswani et al. published their groundbreaking paper, “Attention Is All You Need,” which introduced the Transformer architecture, which achieved state-of-the-art performance on a variety of NLP tasks, including machine translation and language understanding. Its introduction paved the way for several large-scale pre-trained language models, such as BERT, GPT, and T5, which have significantly advanced the field of NLP and led to major breakthroughs in various NLP applications.
Generative Pre-trained Transformer (June 2018). Radford et al. published “Improving Language Understanding by Generative Pre-Training,” ushering in the GPT era. Based on the Transformer architecture proposed by Vaswani et al. GPT combines unsupervised pre-training with supervised fine-tuning to improve its performance on a variety of NLP tasks. The introduction of GPT marked a major shift in NLP research, as it demonstrated the effectiveness of pre-trained language models in improving the performance of various tasks. The GPT model's success led to the development of subsequent iterations, such as GPT-2 and GPT-3, which further pushed the boundaries of what large-scale pre-trained language models could achieve.
BERT (June 2019). Devlin et al. published "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," which introduced a revolutionary pre-trained language mode (i.e. BERT)l that significantly advanced the state of natural language processing and understanding. BERT's power comes from its ability to learn rich, bidirectional representations during pre-training, which enables it to better capture the context and semantics of language. This results in significant improvements in performance and generalization across numerous NLP tasks, making BERT a highly influential and foundational model in the field.
GPT-3 (June 2020). Brown et al. published Language Models and Few-Shot Learners introducing GPT-3 to the AI players stage, a monumental moment in the AI and NLP landscapes. GPT-3 was billed by Brown et al. as an autoregressive LLM with 175 billion parameters, its predecessor GPT-2 had only 1.5 billion. Few-shot learning, if you are unfamiliar, is a technique that enables AI models to learn from a small subset of examples, often less than 10. The model then generalizes its way through new situations. In a sense, GPT-3 picked up on the patterns and concepts contained within user prompts and was able to generalize and generate outputs that followed the form of the examples presented in its inputs.
So here we are, albeit tired and whiplashed from the months of press releases and publishings of academic papers covering the most advanced and disruptive technology we’ve seen in years.
It’s a wonder we’ve only scratched the surface of what is to come. It’s important to remember that ChatGPT, the language model that rocketed LLMs to the forefront of public discourse about the future of technology and artificial intelligence, was only released (at time of writing) about 7 months ago in November 2022. GPT-4, one of the most hotly-anticipated models yet, was made generally available in March 2023. As another example: Meta’s first Llama model was released in February 2023, and just 5 months later, we’ve got a new, open-weight Llama 2 model that’s licensed for innovation. It certainly does feel like the pace of innovation is accelerating, and the outcome is, at this point, anyone’s guess and more pointedly, anyone’s game.