This past year has seen a Cambrian explosion of sorts in the Artificial Intelligence space because of constant breakthroughs. LLMs, multimodal models, copilot interfaces, and many more are pushing the envelope regarding what this new wave of AI is capable of and how accessible it’s becoming to us. These breakthroughs have transformed the landscape of AI applications and broader machine learning research. At the core, all of these advancements are possible because of the rigorous training and inference performed by the underlying hardware that supports these models’ computations. 

A major catalyst in the evolution of AI/ML has been the adoption of Graphics Processing Units (GPUs) for training models. Initially designed for rendering graphics in video games, GPUs possess parallel processing capabilities that can be harnessed to accelerate AI computations. As the demand for AI/ML training surged, GPUs became a game-changer, enabling researchers and data scientists to experiment with larger datasets and more complex architectures because of their superior computing capabilities.

The positive correlation between the progress and development of new computing hardware and the advancements in AI/ML is quite evident. Increasing the quantity of a GPU workhorse reduces training time and inference throughput and paves the way for developing more sophisticated language models. There’s almost a new Moore-esque law with these chips and large models, wherein the performance of AI systems grows exponentially with the proliferation of GPUs, TPUs (tensor processing units), and other new hardware advancements. Traditionally, hardware was primarily optimized for general-purpose tasks, focusing on data management, latency, and various basic computing operations. To most, it was merely about compiling bytecode and executing it in the CPU's arithmetic logic unit. However, the advent of AI/ML advancements and the valuable utility of the GPU has changed this perception.

With AI/ML becoming integral to various industries and applications, understanding the hardware infrastructure that underpins these systems becomes crucial. In a hypothetical world, harvesting GPU units and just throttling training speed (barring the massive amount of energy used) to mass produce original large models would be great. However, these GPU units are the current bottleneck of the broader AI/ML market. There is a computing shortage, and the demand for computing from these GPUs is hockey-sticking and outpacing the physical ability of providers (i.e., Nvidia) to meet this demand. To further understand the implications of this model training race, let's dive into the role of hardware in building AI/ML systems.

The Backbone of Training and Inference

AI/ML hardware architecture is critical in enabling AI/ML systems today. It’s necessary to have a robust hardware architecture in place to train on large datasets and make fast inferences efficiently. In its simplest form, machine learning is reduced to fundamental linear algebra and calculus. Training models involve multiplying data matrices and adjusting model weights by calculating gradients that minimize the model’s prediction error or loss. It’s high school math on steroids - just a little bit of chain rule and matrix multiplication.

 Deep learning models such as neural networks also apply nonlinear activation functions (read more on ReLu, Sigmoids, and Hyperbolic Tangents) to introduce non-linearity into the network, enabling it to learn and approximate complex, non-linear relationships between inputs and outputs. Without non-linear activation functions, the network would behave like a linear model, regardless of its depth, making it incapable of solving more intricate tasks. 

Performing inference boils down to passing in vectors of inputs through the model's layers so that they can be multiplied by the models' weights to generate a concrete output - more matrix multiplication. These computations might seem simple on paper; after all, it’s arithmetic everyone has learned in high school. However, at scale, the computational complexity largely increases. Matrix multiplication is an O(N^3) operation running in cubic time (for those unfamiliar with asymptotic time complexity, refer here). For practical AI/ML systems that ingest a vast amount of data and are invoked multiple times synchronously, computational complexity compounds. This means a lot of arithmetic operations for an ALU to handle. Hardware for AI/ML needs to be optimized for training and using these systems.

Specialized hardware can accelerate these computations. For example, the Apple M1 and M2 chips contain up to 16 neural engine cores explicitly designed for matrix math, performing up to 15.8 trillion operations per second. GPUs like Nvidia's A100 can achieve over 1,500 teraflops of mixed precision computing. (Note: “FLOPS” is an acronym for “floating point operations per second”, but more on this later.) Google's Tensor Processing Units (TPUs) are custom application-specific integrated circuit (ASIC) chips built for tensor operations. (See Note 1 at the end for a definition of “tensor operations”.) This raw computational power enables much faster iteration through batches of training data compared to general-purpose CPUs. As mentioned before, the training process of a general neural network entails a forward pass through the network to make predictions, followed by backpropagation of errors to update weights. Specialized hardware accelerates both phases. Only the forward pass is performed for inference, so the specialized hardware can achieve very high throughput on new data. Dedicated inference chips like Google's TPUs and Nvidia's Jetson boards are fine-tuned to minimize inference latency. Optimizations like batching requests, low-precision numeric formats, and model compression at the hardware level also help bear the computational complexity burden for these models.

New Optimizations

To understand how computing hardware achieves these substantial gains, consider the architecture of a GPU chip. They contain thousands of small, efficient cores for highly efficient parallel execution. In contrast, CPUs have fewer but more extensive and more complex cores focused on general sequential code. Originally designed for graphics processing, Graphics Processing Units (GPUs) were primarily used to render complex images and deliver immersive visual experiences in video games and other multimedia applications. Their parallel processing capabilities made them well-suited for handling the large datasets required to generate real-time graphics. 

However, as the demand for AI/ML capabilities grew, the potential of GPUs beyond graphics became much more apparent. These high-performance processors could efficiently handle the computationally-intensive operations involved in training deep neural networks. By harnessing the power of parallelism, GPUs can process massive amounts of data simultaneously, dramatically reducing training times and enabling the exploration of more complex models. This transition from graphics-centric applications to AI/ML workhorses marks a significant turning point. GPUs are a cornerstone of modern AI/ML infrastructure, empowering researchers and practitioners to tackle complex challenges and achieve groundbreaking results in more profound and more practical AI/ML subdomains such as natural language processing, computer vision, and deep reinforcement learning. 

Taking a step further than GPUs, Google's TPUs’ ASIC chips are specifically built for tensor (3-dimensional matrix) operations using a systolic array architecture. This consists of a grid of simple compute units passing values in a pipelined fashion. Each unit operates only on local data and passes partial results along, which is ideal for fast matrix multiplication. Combined with advanced interconnects, systolic arrays achieve high compute density and power efficiency. 

New hardware primitives are also being introduced into the fold. For example, analog in-memory computing performs multiplication directly in a memory array, avoiding data movement. Phase-change memory and memristors can store weights and enable fast in-situ updates. Optical systems can now use photonics to transfer large amounts of data and enable new types of parallelism. On the software side, frameworks like CUDA by Nvidia, Ray by Anyscale, and Caffe2 help deploy and scale models across different hardware backends. They include optimizations like kernel fusion to combine operations into larger chunks that execute faster. Performance tuning on new hardware involves balancing parallelism, data locality, power usage, and memory hierarchies. Specialized ML hardware like GPUs, TPUs, and custom silicon provide critical acceleration, parallelism, and pipelining capabilities to efficiently execute the math-intensive operations involved in neural network training and inference.

The Significance of Infrastructure Choices

As discussed, choosing the hardware infrastructure plays a pivotal role in determining the speed and efficiency of model training and deployment. The fundamental principle is that more powerful and capable hardware can handle more complex computations, resulting in faster convergence during training and lower latency during inference. One key metric used to measure chip performance and speed is FLOPs. FLOPs help quantify the capacity of a hardware device to execute arithmetic operations involving floating-point numbers within a given time frame. The FLOPs capability of a chip provides a numerical representation of its computational horsepower. When choosing hardware, it's crucial to assess whether the chips' FLOPs capability aligns with the model's complexity and overhead. This ensures the hardware infrastructure is robust enough to handle training and inference reliably.

Latency Bottlenecks

Several bottlenecks can contribute to latency issues, including data processing, model architecture, and the type of hardware used. As stated before, Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs) are commonly used hardware choices for ML tasks because of their parallel processing capabilities. GPUs excel in handling complex calculations and are well-suited for a broad range of models. On the other hand, TPUs are specifically designed for neural network workloads and can provide substantial performance improvements for large-scale models. Choosing the right GPU size or deciding between GPUs and TPUs will depend on the specific requirements of a singular model. A larger GPU typically offers more FLOPs, enabling faster training but at a higher cost. For inference, GPUs and TPUs offer varying levels of efficiency based on their architectures and memory capacities. It’s important to really understand the sufficient hardware requirements for training and inference from an infrastructure spending perspective. It might not be prudent to overspend on compute resources when your model doesn’t truly need it.

Off-the-Shelf Models vs. Custom Training

When considering hardware choices, another crucial decision is choosing to use off-the-shelf pre-trained models or undertake custom training and deployment. Off-the-shelf models are pre-trained on vast datasets and can save time and computational resources during deployment, but they may need to be better-suited to an explicit task or subdomain.  Fine-tuning these models to suit specific tasks requires significant computational power. Raw training and deployment offer the advantage of tailoring models precisely to the task at hand. This is particularly relevant when working with domain-specific data or complex problem domains. However, such customization demands more computational resources and a well-suited hardware setup. 

Balancing hardware performance, efficiency, and cost is a key challenge in ML model development. GPUs and TPUs are known for their high memory bandwidth, which is crucial for handling large datasets efficiently. Obviously, this enhanced performance comes at a cost. GPUs, especially those with higher memory capacities, are more expensive. TPUs, while powerful, require adapting the model to the TPU architecture. There are many choices and tradeoffs to analyze when dealing with machine learning models, many of which are rooted in selecting the specific hardware infrastructure. It’s much more complex than its invisible abstraction in the ML model development process.

The GPU Shortage

In an ideal world, these advanced chips are infinitely distributed to companies seeking them, allowing the generative AI boom to continue its rapid pace of development. However, this is now stymied by a formidable roadblock - an acute shortage of Graphics Processing Units (GPUs). The demand for computing has really hockey-sticked, and the shortage of GPUs is bottlenecking innovation progress, especially for smaller startups and researchers. The substantial size, deep financial resources, and strong market foothold of major tech companies typically grant them more accessible access to GPUs. This makes startups and researchers, who lack similar connections and spending power scramble to acquire these resources.

Notably, Nvidia, which controls about 80% of the GPU market, has grappled to meet this burgeoning demand. The shortage could have been further exacerbated by unforeseen disruptions to supply chains caused by the global COVID-19 pandemic. Nvidia executives claim that the supply/demand issue is largely a cause of a lagging phase shift in the chip manufacturing supply chain. This is intuitive since it takes a bunch of moving components to make up a GPU, the add-in board, packaging, etc. A microcosm of this shortage was experienced during the cryptocurrency boom, which led to shortages of popular GPUs. This shortage is coming back in a larger magnitude now.

One critical manifestation of this shortage can be observed in the dynamics of major cloud service providers. Giants like Amazon Web Services (AWS), Google Cloud, and Microsoft Azure are grappling with oversubscription for Nvidia's newest and most powerful GPU offering, the H100 - which performs 2.3 times faster in training and 3.5 times faster at inference better than its predecessor, the A100. These cloud providers lease out GPU access to AI developers, enabling them to perform resource-intensive tasks like training large-scale generative models. However, due to the shortage, access to these GPUs is severely constrained, leading to waitlists that stretch for months. <Hebbia example> Consequently, the swift pace of development and experimentation is experiencing a lot of friction.

The shortage has also triggered a price surge in the GPU market. An individual H100 with memory and a high-speed interface originally retailed for around $33,000. Second-hand units now cost between $40,000 and $51,000. This cost escalation poses significant challenges, particularly for AI startups with limited financial resources. The financial strain has the potential to stifle innovation, as these resource-constrained, lean startups might need help to afford the computing power necessary to train and deploy state-of-the-art generative models. This threatens the growth trajectory of startups and has broader implications for the overall AI ecosystem. The scarcity of GPU resources affects not only cost but also development velocity. Startups and smaller companies may find themselves lagging behind more well-funded incumbents. They risk losing the "arms race" in generative AI right now. 

With this challenge, generative AI builders actively seek alternatives to navigate the GPU shortage. Some are exploring the option of shifting their model training and inference to alternative chips (MD’s open-source ROCm is making great strides, and its MI250 and upcoming MI300-series chips appear to be promising alternatives). While these alternatives might not match the processing power of Nvidia GPUs, they provide a more available resource for training AI models. Moreover, developers are delving into optimization techniques like quantization and knowledge distillation. These strategies aim to reduce the size of AI models, enabling deployment on less potent hardware without sacrificing performance to an unacceptable extent.

Ultimately, resolving the GPU shortage is paramount to unleashing the potential of the generative AI revolution fully. Industry leader Nvidia anticipates improvements in supply during 2023, providing a glimmer of hope. However, the real challenge lies in ensuring that the supply can keep pace with the rapid evolution of AI capabilities. The AI industry's voracious GPU appetite won't be sated overnight. But as supply chains stabilize and new solutions emerge, the pressures should gradually subside. The GPU gold rush is on for now, and access confers a competitive edge to those who can secure it.

Obviously, many folks in the tech sector are pinning their hopes on the transformative promise of the Generative AI boom. There are just as many folks, if not more, outside the tech industry who have similarly expansive visions for a future underpinned by intelligent, creative software. Although some of these paths diverge in places and often seem altogether too lofty to be plausible, there’s one thing that’s for certain: realizing any of AI’s software potential is predicated on a consistent and growing supply of the hardware that makes it all happen. After all, the cloud is at the lowest level grounded in atoms, not bits.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo