Article·AI Trends & News·Jan 17, 2025

NVIDIA, RTXs, H100, and more: The Evolution of GPU

This article takes you on a journey through time, exploring the progression of AI hardware from its theoretical beginnings to the cutting-edge technologies that power today’s AI boom.

By Zian (Andy) WangAI Content Fellow

Last UpdatedJan 17, 2025

Artificial intelligence (AI) has undergone a remarkable evolution since its inception, and although most of the buzz is around the ingenious algorithms and models that researchers have come up with, hardware played a pivotal role as well. This article takes you on a journey through time, exploring the progression of AI hardware from its theoretical beginnings to the cutting-edge technologies that power today’s AI boom.

Perceptrons and the Mark I Perceptron

The concept of neural networks emerged in the 1940s with the pioneering work of Warren McCulloch and Walter Pitts. In their seminal 1943 paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity", they proposed a model of artificial neurons capable of computing logical functions. This theoretical framework laid the groundwork for Frank Rosenblatt’s subsequent work on the perceptron.

Frank Rosenblatt, an American psychologist and computer scientist, developed the first perceptron model at the Cornell Aeronautical Laboratory in the late 1950s. His research culminated in the Mark I Perceptron, an early machine funded by the U.S. Office of Naval Research that could recognize patterns. Rosenblatt envisioned a system that could mimic the brain’s learning process by adjusting the weights of its connections based on input data. The Mark I Perceptron was one of the first hardware utilized in AI computing.

Despite these promising advances, the perceptron model had significant limitations. The model could only solve linearly separable problems, which was exemplified in the classic XOR problem (we now know that activation functions were the missing key in creating non-linearities).

The Logic Theorist and the JOHNNIAC Computer

Despite the limitations highlighted in Perceptrons, early AI researchers continued to explore computational models that could emulate human intelligence. However, they were constrained by the computing power available at the time.

In 1956, Allen Newell and Herbert A. Simon developed The Logic Theorist, regarded as the first AI program. The Logic Theorist was designed to prove mathematical theorems using symbolic logic.

It ran on the RAND Corporation’s JOHNNIAC computer, an early machine modeled after the Princeton Institute for Advanced Study’s IAS computer. The program successfully proved 38 of the first 52 theorems in Principia Mathematica, marking a significant achievement in artificial intelligence. However, its impact was limited due to the high computational costs of symbolic reasoning and the lack of generalization capabilities.

CPU Optimizations and Implementations

Fast-forward to the late 2000s and the early 2010s, although the interest in AI has not fully risen back to its peak, there is much more research being put in partly due to the advancements in computing hardware.

Before the popularity of utilizing Graphical Processing Units (GPU) in Machine Learning, researchers relied heavily on Central Processing Units (CPUs) to train their models.

In 2011, Vanhoucke et al from Google demonstrated how a carefully optimized fixed-point arithmetic implementation could significantly improve neural network performance on x86 CPUs. By replacing floating-point arithmetic with fixed-point, they achieved a threefold speedup over a state-of-the-art floating-point system on CPU workloads. This optimization leveraged the lower computational overhead of fixed-point arithmetic, allowing for faster neural network computation.

Other optimizations such as SIMD Vector instructions, which allowed the concurrent computation of floating point operations, and cache optimizations were among the many techniques utilized to improve CPU performance.

It should be noted that the computing advantage of GPUs was realized much earlier than in 2011, in fact, Vanhoucke et al’s benchmark showed that GPU blew every CPU optimization out of the water in terms of speed. But this was before the widespread adoption and personal ownership of GPU for machine learning purposes and advanced CPU optimizations still provided tremendous value.

The Popularization of GPU

One of the first instances of publications demonstrating the unparalleled computational advantage of GPUs was in 2005 from D. Steinkraus et al. in the publication “Using GPUs for machine learning algorithms” in which they observed a 3 fold speedup compared to CPU training on a 2 layer fully connected network. A year later, Chellapilla et al. showed that GPU acceleration can be applied to Convolutional Neural Networks as well.

However, it wasn’t until the late 2000s and early 2010s that the popularity of GPUs exploded due to the advent of general-purpose GPUs.

Prior to general-purpose GPUs, graphic cards were dedicated chips specializing only in rendering subroutines. Since the 1990s, major developments in graphic hardware, such as the Voodoo 1 and 2, were mostly dedicated to improving 3D rendering performance for gaming. Towards the turn of the new century, a small graphics card company called NVIDIA released the Nvidia GeForce 256 DDR, coining the term “GPU” while boasting “a single chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second”.

In the early 2000s, researchers discovered that GPUs’ capabilities were not only suited for rendering graphics but also for general computing purposes. In 2003, two research groups, Bolz et al. and Krüger et al., independently discovered that computing solutions to general linear algebra problems run faster on GPUs than CPUs. This was then followed by speedups specifically geared towards machine learning with D. Steinkraus et al’s and Chellapilla et al’s publications.

Despite the advantages that GPUs provided, these efforts required researchers to formulate computational problems in graphics primitives, primary programming in OpenGL’s GLSL or Direct3D’s HLSL. These languages are cumbersome as their original design was for rendering purposes, not general computations.

Introduction of CUDA

NVIDIA recognized the need for a simpler programming model for general-purpose computing on GPUs and launched CUDA (Compute Unified Device Architecture) in 2006. CUDA was a game-changer in the parallel computing world for its ability to let developers harness the power of GPUs for a broad range of tasks, not just graphics rendering

Early CUDA Hardware and Architecture

CUDA’s launch coincided with NVIDIA’s Tesla architecture, which introduced the unified shader model, enabling all GPU cores to execute general-purpose computations efficiently. The first GPU to feature CUDA was the NVIDIA GeForce 8800 GTX, which had 128 unified shaders, delivering unprecedented parallel computing power. This GPU, paired with the CUDA platform, enabled developers to harness its parallelism for tasks beyond rendering.

The Tesla architecture brought several key features:

Unified Shader Architecture: Allowed all cores to be used interchangeably for vertex or pixel processing.
SIMT Execution Model: Introduced the Single Instruction, Multiple Thread (SIMT) model, where threads executed the same instruction simultaneously.
Scalable Parallel Execution: Provided a hierarchical execution model using threads, blocks, and grids.

Early adopters of CUDA quickly realized the potential of GPUs for machine learning. As mentioned previously, GPUs are incredibly performant in executing linear algebra computations, which constitute the majority of machine learning, requiring fast and precise matrix multiplications.

In 2009, Raina et al. published "Large-Scale Deep Unsupervised Learning Using Graphics Processors," showcasing a 70x speedup in training Restricted Boltzmann Machines (RBM) compared to traditional CPUs using NVIDIA GTX 280 GPUs.

A year later, Ciresan et al. demonstrated the power of GPU-accelerated convolutional neural networks (CNNs) in "Deep, Big, Simple Neural Nets Excel on Handwritten Digit Recognition." They achieved a 60x speedup using CUDA on NVIDIA GTX 295 GPUs, setting a new benchmark in handwritten digit recognition. Ciresan et al. exploited the monumental speedup that GPUs provided, citing that “All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning”.

The Tesla GPU series, coupled with CUDA, laid the groundwork for general-purpose GPU computing, providing a consistent programming model and memory hierarchy.

The Fermi Architecture

In 2010, NVIDIA released the Fermi architecture, bringing significant improvements to GPU computing and further cementing the use of GPUs in machine learning research.

Specifically, the Fermi architecture provided several key improvements over its predecessor.

ECC Memory Support: Enabled error detection and correction, crucial for high-performance computing (HPC) tasks.
L1/L2 Cache Hierarchy: Introduced a two-level cache hierarchy to reduce memory latency significantly.
Concurrent Kernel Execution: Allowed multiple kernels to execute simultaneously, improving resource utilization.
Parallel DataCache: Enhanced shared memory and L1 cache, providing better performance for data-intensive tasks.

These features made Fermi-based GPUs highly attractive for machine learning, high-performance computing, and scientific computing.

In 2011, Ciresan et al. again further showcased the power of GPU computing with the paper "High-Performance Neural Networks for Visual Object Classification," utilizing Fermi-based GPUs (NVIDIA GTX 480) to achieve state-of-the-art results in image classification.

HPC and Data Center Adoption

The Fermi architecture’s reliability and performance quickly attracted the attention of high-performance computing (HPC) researchers and data center operators. Its ECC memory support, large memory bandwidth, and parallel computing capabilities made it suitable for large-scale computations.

The Lawrence Livermore National Laboratory became one of the earliest adopters of Fermi GPUs for HPC research. They utilized NVIDIA Tesla C2050 GPUs for scientific computing and simulations.

At this point, cloud services such as Amazon Web Services (AWS) and Microsoft Azure introduced GPU instances with NVIDIA Tesla GPUs for customers seeking accelerated computing.

The Kepler Architecture

Following the groundbreaking success of the Fermi architecture, NVIDIA continued to push the boundaries of GPU computing with the introduction of the Kepler architecture in 2012. Kepler built upon Fermi’s foundation, making significant improvements that captured the attention of machine learning researchers and industry professionals alike.

The Kepler architecture featured several innovations that made GPUs more efficient and better suited for high-performance computing tasks.

Dynamic Parallelism: Enabled GPU threads to spawn additional threads on their own, reducing CPU involvement and streamlining parallel workloads.
Hyper-Q Technology: Allowed multiple CPU cores to issue work to a single GPU concurrently, significantly improving GPU utilization.
Energy Efficiency: Kepler GPUs boasted a substantial increase in energy efficiency, delivering high computational power without excessive power consumption.

The flagship Kepler GPU, the NVIDIA Tesla K20, quickly became a favorite among machine learning researchers. The architecture’s improvements translated into remarkable real-world performance.

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton revolutionized computer vision with AlexNet, a deep convolutional neural network trained using two NVIDIA GTX 580 GPUs based on the Kepler architecture.

AlexNet achieved a top-5 error rate of 15.3% in the ImageNet competition, leaving competitors in the dust and demonstrating the immense power of GPU-accelerated deep learning. Their model had over 60 million parameters and the acceleration of GPUs made training feasible. AlexNet became one of the most influential papers in computer vision of all time, accumulating over 120,000 citations to date.

The Kepler architecture had firmly established GPUs as essential tools in machine learning research, but NVIDIA wasn’t done yet.

The Pascal Architecture and the Deep Learning Boom

In 2016, NVIDIA introduced the Pascal architecture, which further cemented GPUs’ place at the forefront of machine learning research. With a clear focus on deep learning, the Pascal architecture was a significant leap forward.

NVIDIA Tesla P100 was the flagship Pascal GPU designed specifically for deep learning. Specifically, it introduced the following improvements:

NVLink: Provided faster data transfer between GPUs than traditional PCIe connections, a game-changer for multi-GPU training.
High Bandwidth Memory (HBM2): Offers dramatically higher memory bandwidth, essential for processing large datasets quickly.
16-bit Floating Point (FP16): Support for half-precision operations doubled throughput for certain tasks, enabling faster training without sacrificing accuracy.

The Tesla P100 GPU quickly became the go-to choice for machine learning researchers and data centers. In fact, Tesla P100s are still one of the best options for cloud computing on the cheaper side in terms of the speed and memory capacity that it offers. For example, the Kaggle platform provides its users with the Tesla P100 GPU to speed up computations.

The Pascal architecture enabled researchers to train deeper networks and handle larger datasets, fueling a new wave of breakthroughs in computer vision and natural language processing.

Volta Architecture and Tensor Cores: The First Dedicated AI GPU

In 2017, NVIDIA unveiled the Volta architecture, representing a seismic shift in GPU computing with the introduction of the first dedicated AI GPUs (not just for general computations) for HPC datacenter with big bold titles of “WELCOME TO THE ERA OF AI” on the release page.

With the V100, Nvidia introduced tensor cores along with CUDA cores, which are specialized hardware designed to accelerate mixed-precision tensor operations, crucial for deep learning

Nvidia also introduced NVLink 2.0, the technology that enables multiple GPUs to communicate and work together, enabling faster multi-GPU training.

The flagship Tesla V100 GPU delivered up to 125 TFLOPs, or 125 trillion floating points operations per second, of deep learning performance, marking a revolutionary step in AI hardware evolution.

The addition of tensor cores enabled mixed-precision training with FP16 computations while maintaining FP32 accuracy, allowing unprecedented training speeds.

The V100 GPUs also saw a huge leap in memory capacity with one of the first dedicated GPUs built for machine learning packaging a whopping 32 gigabytes of VRAM while at the time most consumer-grade GPUs only had around 12. An increase in memory allowed larger models to fit in a single GPU, saving money and time to optimize parallelization across multiple GPUs.

Introduction of Tensor Processing Units (TPUs)

Parallel to the developments in GPU technology, Google embarked on a different path by developing its custom hardware tailored specifically for machine learning tasks. After the popularity of Deep Learning exploded with their acquisition of DeepMind and many other talents in the field such as Geoffrey Hinton, Alex Krizhevsky, and Ilya Sutskever, they are facing an imminent need to accelerate their computing powers that Nvidia GPUs just could not suffice at the time.

Unlike GPUs, which are general-purpose parallel processors, TPUs are application-specific integrated circuits (ASICs) designed specifically for tensor computations, making them highly efficient for certain types of deep learning tasks.

The conceptual foundation for TPUs was built upon the idea of systolic arrays, first proposed by H.T Kung and Charles E. Leiserson in the late 1970s. These arrays, designed to rhythmically compute and pass data, were perfect for the matrix operations central to machine learning algorithms.

The simplicity, regularity, and efficiency of systolic arrays made them ideal for VLSI (Very Large Scale Integration) technology, enabling Google to implement them in TPUs to optimize operations like matrix multiplication—a staple in neural networks.

Once operational, TPUs quickly demonstrated their value, significantly accelerating tasks across a variety of Google’s core services, from search engine algorithms to powering innovations like Google Photos. Perhaps most famously, TPUs were a key component behind the success of AlphaGo Zero, the DeepMind AI that defeated the world Go champion, Lee Sedol. The efficiency and speed of TPUs allowed AlphaGo Zero to “think” faster and analyze more moves ahead, making it a formidable opponent.

Google’s TPUs remained an internal secret until May 2016, when Sundar Pichai made the formal announcement at the Google I/O conference, revealing that TPUs had been enhancing Google’s data centers for over a year. This revelation underscored Google’s commitment to integrating AI into its ecosystem and marked a significant step in public cloud infrastructure, offering external developers access to this powerful technology.

The open-source library TensorFlow developed by Google is integrated with TPU acceleration capabilities out of the box, allowing users to use the power of TPUs in a few simple lines of code. Furthermore, the Kaggle platform provides its registered users with 30 hours of free TPU usage every week to facilitate model training on their platform.

Fast forward to today, Google’s TPUs are on their 4th and 5th iteration, providing blazing speed for large-scale inference and training. Specifically, each TPUv4 chip can deliver up to 275 TFLOPS, and each of these TPUs is contained in a TPU “pod” with 4096 TPUv4s providing 10 times the bandwidth per chip compared to clustered GPU strategies.

Modern GPUs and the Age of Personal Compute

As the year turns to the late 2010s and early 2020s, with the COVID pandemic and the explosion of machine learning advancements, the GPU space is dominated by NVIDIA hardware with tens of GPU releases in a mere span of 5-6 years. Here’s a rapid-fire of notable GPUs released during the time.

NVIDIA Titan V (2017)

The Titan V was NVIDIA’s first GPU based on the Volta architecture, boasting impressive compute performance and 12GB of HBM2 memory. While not explicitly designed for gaming, the Titan V offered a glimpse into the future of GPU computing, making it an attractive option for researchers and enthusiasts alike.

NVIDIA Titan RTX (2018)

Building upon the success of the Titan V, the Titan RTX introduced real-time ray tracing capabilities to the Titan lineup. With its 24GB of GDDR6 memory and advanced RT cores, the Titan RTX catered to professionals working in fields such as scientific visualization, rendering, and AI development.

To this day, the Titan RTX remains as one of the best options for consumer GPU in terms of performance and memory capacity with its 576 tensor cores and 130 TFLOPs of performance packaged with 24 GB of memory, great for both professionals and gamers.

NVIDIA A100 (2020)

The A100 marked a significant leap in NVIDIA’s data center offerings, featuring the Ampere architecture and introducing multi-instance GPU (MIG) support. With its 40GB of HBM2 memory offering 312 TFLOPs of performance, the A100 became a go-to choice for large-scale AI training and HPC workloads.

NVIDIA H100 (2022)

NVIDIA’s latest and most powerful data center GPU, the H100, is a true powerhouse. Featuring the Hopper architecture and a groundbreaking Transformer Engine optimized for AI workloads, the H100 boasts an impressive 80GB of HBM3 memory with up to 989 TFLOPs of performance.

While its performance is unmatched, the H100 comes at a premium price and is primarily targeted at data centers and large-scale research facilities with its massive memory ideal for large-scale training and Large Language Model inference.

The NVIDIA H100 today remains as one of the top choices for large companies to train models at scale.

NVIDIA L4 (2023)

On the other end of the spectrum, the NVIDIA L4 is a specialized GPU designed for cloud gaming and media streaming applications.

With its low power consumption and optimized architecture, the L4 aims to deliver a seamless streaming experience while minimizing the hardware footprint in data centers.

The L4 GPU also came with 24 gigabytes of memory, which was much more than its computation equivalent counterparts with most only having 16.

Consumer GPUs for Gaming and Compute

While NVIDIA’s data center offerings cater to the most demanding workloads, the company has also consistently pushed the boundaries of consumer graphics performance, catering to gamers and personal computational uses alike.

NVIDIA RTX 20 Series (2018-2019)

The RTX 20 series marked NVIDIA’s introduction of real-time ray tracing capabilities to consumer graphics cards. Led by the flagship RTX 2080 Ti, this series delivered a significant performance uplift over the previous generation, albeit with increased power consumption. While ray tracing performance was initially limited, the RTX 20 series paved the way for more immersive and realistic graphics in games.

NVIDIA RTX 30 Series (2020-2021)

Building upon the foundations laid by the RTX 20 series, the RTX 30 series brought substantial performance improvements, better ray tracing capabilities, and the introduction of NVIDIA’s Deep Learning Super Sampling (DLSS) technology. From the entry-level RTX 3060 to the flagship RTX 3090, this series offered a compelling blend of performance and features for gamers and content creators alike.

NVIDIA RTX 40 Series (2022-2023)

NVIDIA’s latest consumer graphics lineup, the RTX 40 series, powered by the Ada Lovelace architecture, has raised the bar once again. With impressive performance gains, improved ray tracing and DLSS capabilities, and cutting-edge features like Frame Generation, the RTX 40 series caters to the most demanding gamers and content creators. However, this power comes at a cost, with high power consumption and premium pricing, particularly for the higher-end models.

At the high-end of the RTX 40 series, the RTX 4090 boasted an impressive 24 GB of VRAM offering over 80 TFLOPS in performance. This general purpose GPU became the top choice for not only gaming, but also local deep learning projects.

Conclusion

Of course, it is impossible to cover the entirety of AI hardware’s history, but hopefully, this article provided a glimpse of how the advent of GPU and AI hardware came to be. Although the majority of the field is dominated by NVIDIA hardware, there are many other competitors in the hardware space, each serving its specialized purpose, including but not limited to Apple’s silicon and its own MPS acceleration framework, Microsoft’s custom Maia 100 and Cobalt 100, and AWS’s Trainium chips.

As the landscape of AI hardware continues to evolve, these innovations promise to drive further breakthroughs, making advanced computing more accessible and impactful across various sectors.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.