NVIDIA’s Dominance and the RTX 40 Series: The growth of AI Hardware
The success of Artificial intelligence (AI) today is not just thanks to algorithmic and software advancements; hardware also plays a vital role. Algorithms like the neural network date back to the 1950s. In the 1980s, Geoffrey Hinton and his colleagues popularized the backpropagation algorithm. Between 1989 and 1998, Yann LeCun contributed to creating Convolutional Neural Networks (CNN) to recognize handwritten digits.
Although AI has been around for some time, these algorithms weren't considered practical for real-world applications during this period because the computational demands were unmet. As a result, this led to the AI Winter. Fortunately, this period ended thanks to hardware that could handle the computational requirements of AI: the Graphics Processing Unit (GPU).
The GPU was built to handle large amounts of data simultaneously, which was extremely useful for computer graphics. Industries like gaming, computer animation, and Computer-Aided Design (CAD), which required high-quality graphics, were the ones that utilized GPUs the most.
So, how did GPUs become the dominant hardware for AI as well? The short answer is that both computer graphics and AI involve enormous amounts of parallel computation, heavily relying on linear algebra operations. However, many more details go into this answer, which we will explain later in this article.
In this article, we will examine the evolution of the GPU to see what advancements and innovations led to it becoming the number one hardware choice for AI. Our journey began in the early days of the GPU when it was specialized for graphics and became a more general-purpose computational tool.
The Early Days of GPUs.
In the early days of computer graphics, the Central Processing Unit (CPU) handled all graphical processing. Due to their sequential processing nature, CPUs proved to be very inefficient. Originally, this inefficiency could be ignored when computer graphics weren't very intensive. Still, as graphics became more demanding, the need for a dedicated processor to handle graphics arose, leading to the birth of the GPU.
Initially, hardware dedicated to graphics processing wasn't called a GPU; instead, it was referred to as graphics cards or accelerators. The video game industry was the primary focus for these processors as games evolved from simple 2D to 3D.
In the 1990s, companies like 3dfx, ATI, and NVIDIA were major players in developing graphics processors. 3dfx had their Voodoo series, ATI had RAGE, and NVIDIA had RIVA. Alongside these hardware developments came APIs that interacted with them, including DirectX and OpenGL.
In 1999, NVIDIA introduced the GeForce 256, which they marketed as "the world's first GPU.” This popularized the term GPU and ushered in a new era of graphic processing. The major innovation of the GeForce 256 involved the integration of Transformation (and Lighting) on a single chip. The GeForce 256 would be the first in the GeForce series, which is still used today.
Going Beyond Graphics
In the 1990s, GPUs were considered specialized hardware for graphics. By the early 2000s, parallel computation was sought after in other domains. Efforts were made to make GPUs more general-purpose, known as General-Purpose GPUs (GPGPU). These general-purpose applications included scientific computing and High-Performance Computing (HPC).
However, creating solutions for other domains was extremely difficult. Graphics libraries like OpenGL had to be utilized, rewriting non-graphical computations as graphics problems. During this period, GPUs were used for problems in Machine Learning (ML), but they didn't become mainstream due to their programming complexity.
In 2006, NVIDIA introduced Compute Unified Device Architecture (CUDA), both a platform and an API. CUDA simplified the development of GPGPU solutions. With CUDA, there was no need to have a background in computer graphics to reap the benefits of parallel computing. NVIDIA also updated their GPU cores to be more general-purpose, referring to them as CUDA cores.
NVIDIA eventually shifted its focus to scientific computing, a bet that would prove successful in the coming years. CUDA was applied to various domains, such as simulation, Computer-Aided Design (CAD), Computational Fluid Dynamics (CFD), and other fields that required acceleration.
The Deep Learning Boom!
With the release of CUDA, several researchers began utilizing GPUs for deep learning. Early applications ranged from simple Multi-Layer Perceptrons (MLPs) to Convolutional Neural Networks (CNNs), mostly trained on datasets like MNIST and NORB. Despite all this work, they never gained the general recognition of the broader ML community.
Fast-forward to 2012. AlexNet, a CNN trained on 2 NVIDIA GeForce GTX 580 GPUs, won the ImageNet competition, beating the second-place entry by a large margin and setting a new state-of-the-art (SOTA).
AlexNet rekindled interest in the field by showcasing deep learning capabilities when run on GPUs.
Making GPUs More Specialized for AI
After AlexNet and subsequent works, it was clear that applying Accelerated Computing with AI was the future. NVIDIA started taking big bets in AI, including developing GPUs specifically targeted towards AI.
Tensor Cores
NVIDIA introduced dedicated cores for AI, known as Tensor Cores, in their Volta microarchitecture. These cores have built-in features that accelerate the training of AI models. One of these features is Mixed Precision Computing.
Official video by NVIDIA on Tensor Core
Another feature Tensor Cores possess is the fusion of multiplication and addition into a single operation, since these combinations are the most common in neural networks.
Data Center GPUs
Training on consumer hardware like GeForce GPUs was no longer feasible as AI models grew. So, NVIDIA introduced dedicated GPUs for such high workloads, referring to them as NVIDIA Data Center GPUs. Here's a list of some of their most popular Data Center GPUs:
T4: The T4 GPU is based on the Turing microarchitecture and is mostly suitable for inference. It is famously used as the free-tier GPU in Google Colab.
P100: The P100 is based on the Pascal microarchitecture. The original Transformer model was trained using 8 of these GPUs.
V100: The V100 is based on the Volta microarchitecture and was one of the first GPUs to have a Tensor Core.
A100: The A100 is the GPU powering the AI race; models like ChatGPT, LLaMA, and Stable Diffusion are trained using it. It is based on the Ampere microarchitecture.
AI Supercomputers
Despite their power, data center GPUs were not powerful enough to handle the computation needed to train large models. For instance, the 65B variant of LLaMa was trained on 2048 A100 GPUs. The high computing demand resulted in the creation of AI supercomputers that contain several GPUs connected via PCIe or NVLink.
With the connection of multiple GPUs, high communication speed is needed. NVIDIA supercomputers use InfiniBand for their communication. This technology is so vital that NVIDIA acquired Mellanox, a company that focuses on producing it.
In 2016, NVIDIA launched its DGX-1 supercomputer, consisting of 8 Data Center GPUs connected with 128 GB of High Bandwidth Memory (HBM). NVIDIA referred to the DGX-1 as the world's first AI supercomputer, with options to configure it with Volta or Pascal GPUs. NVIDIA continued releasing more DGX systems based on the current GPUs they had at the time. For example, the DGX A100 features 8 A100 GPUs connected.
NVIDIA built even more powerful supercomputers based on the DGX, one of which is Selene, which contains 4,480 A100 GPUs.
Comparison Between GPUs and Other AI Hardware
Over the years, several new processors have emerged to accelerate AI training and inference. Here are a few of those processors.
Tensor Processing Units (TPU): TPUs are Application-Specific Integrated Circuits designed by Google to handle their AI workload. Due to their specialized design for AI tasks, they outperform GPUs on benchmarks.
Neural Processing Units (NPUs): NPUs are dedicated hardware components designed to accelerate AI tasks. They are specifically optimized for neural network operations, such as convolution, pooling, and normalization. Companies like Huawei, Samsung, and Apple have integrated NPUs into their smartphones and other edge devices to improve AI performance and power efficiency.
ASICs/FPGAs: Application-specific integrated Circuits (ASICs) and Field-Programmable Gate Arrays (FPGAs) are two other types of hardware accelerators used for AI tasks. ASICs are custom-designed chips optimized for specific tasks, like accelerating AI algorithms, while FPGAs are reconfigurable chips that can be programmed to perform specific AI tasks after manufacturing.
Language Processing Unit (LPU): LPU is acceleration hardware designed specifically for inference. It offers a significant boost in performance compared to GPUs during inference. Groq created it.
The Present and Future of AI Hardware
We are probably approaching the end of Moore’s Law, but GPUs are currently experiencing a new law of their own: Huang’s Law. Named after NVIDIA’s CEO, Jensen Huang, the law states that the performance of GPUs will more than double every two years. Each new GPU NVIDIA introduces is by far more potent than its predecessor.
The RTX 40 series
On October 12th, 2022, NVIDIA unveiled the GeForce RTX 4090, the first GPU in the GeForce RTX 40 series. It is currently the market's most powerful consumer graphics card, with 16,384 CUDA cores, 512 Tensor Cores, and 24 GB of VRAM.
With the rise of large AI models, one might question the RTX 4090's capability to keep up. However, thanks to model optimization techniques such as quantization, it is possible to run large models like Llama2 7B and 13B on the RTX 4090.
Fine-tuning of these models is also possible on the RTX 4090, thanks to Parameter-Efficient Fine-tuning (PEFT) techniques like LoRA. For even larger models, combining two or more RTX 4090 GPUs can provide the necessary computational power to fit them into memory, allowing them to be used for inference and fine-tuning.
The Hopper Architecture
H100 Introduction by NVIDIA CEO
The current most powerful data center GPU is the H100, featuring 80 billion transistors, 80 GB of HBM, and a transformer engine that helps accelerate the training of transformer models. The H100 can perform computations in FP8 format and outperforms the A100 by orders of magnitude. It is also available in the DGX form factor, the DGX H100.
The Future is Blackwell
At GTC 2024, NVIDIA introduced Blackwell, its next-generation GPU microarchitecture. They unveiled the B200, the successor to the H100. Unlike NVIDIA’s previous architectures, the B200 consists of two dies merged to form a single chip, boasting 208 billion transistors and 192 GB of HBM. B200’s performance is 5x greater than the H100 during inference and 2.5x greater during training.
Blackwell represents the future of NVIDIA's datacenter GPUs and has set a new precedent for what is to come in the AI hardware space. The Blackwell architecture is expected to be released in late 2024.
The Growing Competition
NVIDIA has dominated the AI hardware space for quite some time, but it remains to be seen how long this will continue. One of NVIDIA's rivals in the semiconductor space, AMD, is catching up with them through their Instinct lineup of server GPUs. Even their consumer hardware lineup, Radeon, competes with the RTX 40 series in running lightweight models like Llama.
The proprietary nature of CUDA is also problematic as more open-source alternatives are emerging, like AMD’s ROCm, which enables development on both AMD and NVIDIA GPUs. Intel also has its CUDA alternative, oneAPI, which is hardware-independent, unlike CUDA.
Conclusion
AI hardware has significantly evolved since the days of AlexNet. Once centered around graphics acceleration, the GPU industry is now shifting its focus toward AI, dedicating massive resources to research, development, and infrastructure across the stack to accelerate ML workloads.
The AI hardware space has become a pillar of the AI revolution, as evidenced by the success of NVIDIA's stock and the number of companies rushing to create the next big solution built around this infrastructure.
AI is now a part of our daily lives more than ever. With this comes great expectations for what AI can do. Consumers don’t just want AI to be smart; they want it to be fast, easily accessible, and affordable. Many of these advancements will depend on the hardware we use to run these AI models, and we can expect more innovations as competition increases in the AI hardware space.
FAQ
What's the difference between GPUs and CPUs?
GPUs and CPUs are both processors, but GPUs are designed for parallel processing and are better suited for tasks that can be broken down into many smaller, independent parts, while CPUs are optimized for single-threaded performance and are better for tasks that require complex logic and decision-making.
What is Parallel Computing?
Parallel computing is a method of simultaneously executing multiple tasks to solve a larger problem. It involves breaking down a large computational task into smaller, independent parts that can be processed simultaneously on different processors or cores.
What does it mean for computation to be embarrassingly parallel?
If a computation can be easily divided into independent parts with little to no effort required to parallelize them, it is called "embarrassingly parallel." Neural networks are an example of an embarrassingly parallel computation, which is why they can be efficiently run on GPUs.
What is CUDA?
NVIDIA created CUDA, a parallel computing platform that enables its GPUs to perform general-purpose computing tasks in addition to graphics rendering.
What’s the difference between CUDA cores and Tensor cores?
CUDA cores are NVIDIA's general-purpose processing cores for various computing tasks. Tensor cores, on the other hand, are specialized cores designed specifically for handling AI workloads through mixed-precision computing.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.