In the previous article on Building the LLM Stack, we covered various popular LLM architectures in depth, detailing their additions and changes compared to the original Transformer introduced in 2017. But LLMs can’t perform any tasks without training! An empty architecture is as useless as an empty text editor without code.

In this article, we are going to cover the training process of LLMs. Specifically, the pre-training phase.

Stages of LLM Training

In most modern Large Language Models, training is executed in two stages: the pre-training stage and the fine-tuning stage.

In the pre-training stage, models are typically trained on an enormous corpus of text data, typically spanning across many diverse areas with its text gathered from every corner of the internet. This training stage is intended to “teach” the model how our human language actually “works”. There are no particular goals that the training is aimed at, other than gaining a comprehensive understanding of the human language as a whole.

In this stage, the most important component is not how the model is trained, as we will explore in this article, most models are trained using the same mechanism, but rather the data that is leveraged in the pre-training phase.

The second stage of training, fine-tuning, the model is trained tailored to a specific task, whether it is conversational chatbot, question-answering assistant, or any language related use case.

Just as the training of Large Language Models is executed in two main stages, the journey of education mirrors this process closely. Imagine the pre-training phase of LLMs as the foundational years in school—elementary, middle, and high school. In these early years, students are exposed to a broad curriculum, covering a wide array of subjects from mathematics and science to literature and the arts. This broad spectrum education is crucial, as it lays the groundwork for understanding the world at large. Similarly, during the pre-training stage, LLMs are fed a vast and diverse range of text, learning the intricacies of human language without a specific end-goal in sight, much like a student learning about the world in its broadest strokes.

Transitioning to the fine-tuning stage of LLM training parallels entering college or university, where the choice of a major allows students to specialize in a field of their interest. This is like the fine-tuning of LLMs, where models, now equipped with a general understanding of language, are further trained on specific tasks—be it generating human-like chat responses, answering complex questions, or any other specialized language task. This stage allows the models to refine their capabilities, focusing their vast knowledge to excel in particular areas, just as a college education hones a student’s skills and knowledge in their chosen field.

In this article, our focus is on the pre-training stage of LLMs, how the training works and most importantly, how the data is collected and processed. Furthermore, we will cover the evolution of pre-training datasets, from the ancient GPT model to the modern LLMs, shining light on the importance of data in Large Language Models.

The Pre-Training

There generally are two flavors of Large Language Model pre-training: Casual Language Modeling and Masked Language Modeling.

Casual Language Modeling, mostly popularized by the famous original GPT model paper, where its main contribution was introducing Generative Pre-Training that became the nuts and bolts of modern LLMs. Casual Language Modeling involves training the model to predict the next token in a series of text. In other words, it can be described as “next-token prediction”, where the model attempts to generate the next token based on previous context.

Masked language modeling vs. casual language modeling

Masked language modeling vs. casual language modeling

On the other hand, Masked Language Modeling, introduced by BERT, aims to encourage the model and understanding of the text from both directions. Instead of predicting the immediate next token, Masked Language Modeling randomly masks tokens from a corpus of text, requiring the model to “fill-in the blanks” based on not only proceeding context but also information after (thus the name "Bidirectional Encoder Representations from Transformers").

Modern pre-training techniques typically utilize Causal Language Modeling. While most papers do not specifically mention this point, it is assumed to be the default method of training, especially for decoder-only models that is the standard in modern LLMs.

Casual Language Modeling

In Casual Language Modeling, the entire dataset of text is first tokenized then chunked into sequences. The choice of tokenization method is determined by the model selected, rather than the pre-training parameters. Currently, most Large Language Models (LLMs) utilize a form of Byte Pair Encoding Tokenization (BPE). For further details on this, refer to the previous article in the series.

The length of each sequence depends on the maximum context length of the model. Sequences can extend across different text segments based on the original dataset’s structure. These sequences are the inputs to the model.

For labels, the input sequences are shifted one position to the left. This means each token in the input sequence corresponds to the next token in the original text for the label sequence.

The model is then subsequently trained on the dataset to predict the next token based on previous context.

In this phase of pre-training, the most crucial element is not how the “training” is done, as for the most part, this process looks about the same from model to model. It’s not difficult to interpret how Large Language Models are trained, it’s the dataset and the steps taken to extract the fine-grained nuances out of it which takes the most effort.

Pre-Training Datasets

Continuing from the foundation laid in understanding the pre-training process of LLMs, we now shift our focus to one of the most pivotal elements in the training of these models: the dataset.

The Common Crawl

One of the most well-known datasets utilized to pre-train Large Language Models, The Common Crawl dataset, was originally designed to map the web by crawling and archiving web pages. At its core, Common Crawl is a raw representation of the internet through years of web archiving and data collection through web crawling. This dataset is updated monthly, ensuring a continuous influx of fresh data that reflects the evolving nature of the internet. Surprisingly, the Common Crawl is dominated by patent filings.

The Common Crawl contains over 250 billion webpages spanning more than 17 years of internet history.

The adaptation of Common Crawl for training LLMs began to gain traction in the early 2020s. Researchers recognized its potential to provide the massive, varied corpus necessary for pre-training models to understand and generate human language.

Prior to the adoption of general-purpose, larger datasets, training typically involved employing smaller, specialized datasets that are tailored to specific tasks. Additionally, early models and technologies make it nearly unfeasible to parse and construct datasets of large sizes.

For example, the Penn Treebank dataset, which is composed of words tagged with their respective Parts-of-Speech, was used to evaluate the Long-Short Term Memory model at its debut in 1997. As models grew in size, they became increasingly general-purpose, and the requirements for the labeling of datasets is diminishing quickly with the popularization of unsupervised pre-training techniques.

One of the first models to utilize the Common Crawl dataset was GPT-2. As stated in OpenAI’s paper, “Most prior work trained language models on a single domain of text, such as news articles, Wikipedia, or fiction books. Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied domains and contexts as possible”. At the time, this was only made possible by the immense scaling of GPT-2 in terms of parameter count compared to other text-based models at the time. The largest version of the model contained more than 1 billion parameters, capable of being trained on the Common Crawl dataset without the model losing much of its information due to size limitations.

However, as noted again in the GPT-2 paper, the Common Crawl dataset by itself is an unfiltered, raw representation of webpages with much of its content unintelligible or unfit to be used for training. Previous works obtained much better results by training on only a subset of the Common Crawl dataset.


GPT-2 did not end up using the Common Crawl dataset but instead created their own scrape called “WebText” which contained only filtered and human-curated web pages. The WebText used to train GPT-2 contained over 40 million documents and came to about 40 gigabytes in size, a fraction of size compared to the Common Crawl.

WebText consisted of scraped outbound links on Reddit with over 3 karma, which can serve as “a heuristic indicator for whether other users found the link interesting, educational, or just funny”.

Today, there is an open-source clone of WebText, OpenWebText, developed by independent researchers that mimics the data collection process of WebText. In their version, better URL filtering and global fuzzy de-duplication were implemented to further increase the dataset quality.

The selection of the dataset is not only the crucial part of pre-training, but so is the preprocessing to ensure the quality of said dataset.

Dataset Preprocessing

When it came to the introduction of GPT-3 several years later, the measly 40 million documents from WebText is simply not sufficient to satisfy the ambition that the model ensued. For GPT-3, the Common Crawl was large enough to be trained on without the model ever seeing the same set of tokens twice. However, as mentioned previously, the quality of the dataset, even when it’s lightly filtered, lacks in performance compared to smaller, curated datasets.

The preprocessing steps that it took to pre-train GPT-3 can be outlined in three categories: filtration, deduplication, and diversification. Later iterations of other Large Language Models can be assumed to follow the same general steps that GPT-3 took to preprocess their data as most papers do not explicitly explain the steps that they take to do so.

In the case of GPT-3, the authors performed filtration and de-duplication on the Common Crawl dataset and diversified it with other smaller, but much cleaner and curated datasets.

Specifically, the filtration and de-duplication involved the following steps:

  1. A classifier was developed to identify and prioritize high-quality documents within Common Crawl. This was achieved by training a logistic regression classifier to differentiate between high-quality documents (using WebText, Wikipedia, and a web books corpus as examples of high-quality content) and lower-quality, unfiltered Common Crawl documents.

  2. Documents were scored by the classifier and selected for inclusion in the dataset based on a specific condition involving a Pareto distribution, allowing for the inclusion of mainly high-scoring documents but also some that were out of the usual distribution to ensure variety.

  3. De-duplication was accomplished using MinHashLSH (Locality Sensitive Hashing)

  4. This process also included the removal of WebText documents from the Common Crawl dataset to prevent duplication of content already considered high-quality.

As for diversification, the pre-training of GPT-3 included other curated, tailored datasets such as a newer version of WebText, Books1, Books2, and the English Wikipedia.

From GPT-3 paper

From GPT-3 paper

The exact contents of “Books1” and “Books2” are rather unknown and it is assumed to be a subset of books from the public domain. While Wikipedia, on the other hand, has the highest contribution-per-word in the training of GPT-3.

The Proprietary Nature of LLMs

Unfortunately, due to the competitive space of LLMs and the nature of LLMs themselves, most modern LLMs do not disclose the datasets that were used to pre-train their model. Even for one of the more “open” sourced models, the Mistral 7B, their authors stated that “Unfortunately we’re unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field”.

Conversely, this also comes to show the importance in quality of the training data and how much of an impact it can cause on the capabilities of models. From the limited information available to the public, we can suspect that many of these high-performing models, big or small, may be trained on more than just the public internet.

In the technical report of GPT-4, the authors claimed that the pre-training was performed on “both publicly available data (such as internet data) and data licensed from third-party providers” while in the blog post of Phi-2, another small but powerful model, it was said that the training involved “a mixture of Synthetic and Web datasets for NLP and coding”.

As we move forward, the landscape of LLM training is poised for transformation. The emphasis on proprietary datasets is likely to persist, given the competitive advantages they confer.

The actual training mechanisms of LLMs may show incremental innovations, but the real evolution is happening in the realm of dataset development. Advances in data anonymization, synthetic data generation, and ethical sourcing are paving the way for datasets that not only enhance model performance but also adhere to the highest standards of privacy and fairness.

However, there is a gap between the power of the general public and larger corporations and this gap is only aimed at widening itself.

This gap reflects a broader issue within the field of AI and LLM pre-training, where access to, and control over, vast and diverse datasets becomes a gatekeeper of innovation and advancement. As a result, the democratization of AI research and development is at risk, with smaller entities and independent researchers finding themselves increasingly sidelined by the resource and data monopolies held by large corporations.

This widening gap emphasizes the critical role that accessible, high-quality datasets play in the evolution of LLM pre-training, but the evolution may not be in the hands of the public.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo