Deep learning researchers are inundated with new research results. For anyone working adjacent to deep learning, it is a battle to stay ahead of the news cycle and to connect those advances to real-world applications.

This post will be the first in a short series of articles meant as resources for covering the basics of deep learning. These articles will be written with non-researchers in mind, and no question is too basic to ask in the discussion threads these posts generate. 

AI generated image with DALL-E 2 with prompt “an oil pastel drawing of three serious-looking cartoon brains running a race on a racetrack”

AI generated image with DALL-E 2 with prompt “an oil pastel drawing of three serious-looking cartoon brains running a race on a racetrack”

Here, we’ll provide general background information for understanding deep learning. It will cover some key terminology and a high-level history of the field since its founding. There’s a lot to cover.

So without further ado, let’s dive right in!

What’s in a Name?

Deep learning practitioners throw around a number of connected terms in ways that can be confusing to people outside their space. For example, AI, deep learning, neural nets, and machine learning are closely related and are often used semi-interchangeably. However, there are some key differences.

Source: MDPI

Source: MDPI

Artificial Intelligence (AI) is when computers can do things that usually require human intelligence to do, like seeing, hearing, understanding language, and making decisions. It includes subfields like machine learning, rule-based systems and evolutionary algorithms.

Machine Learning (ML) is when computers can learn how to do something by themselves, instead of being told what to do by humans. Take the example of a chess-playing bot. If the bot learned how to play the game by only examining examples of successful games, then that would be a use of ML. On the other hand, if the bot played by following a set of rules like ‘if your opponent moves their rook to challenge the queen, then move the queen to a safe position’, then this rules-based system, while complicated, is not an example of ML.

Neural Nets (sometimes known as Neural Networks) are a specific type of ML model that are inspired by the structure and function of the human brain. Neural nets take a connectionist approach to AI based on a parallel with neuroscience. Just as the brain computes with interconnected nets of processing units (neurons), ML systems can be designed in the same way. We assign weights to these interconnections. By modifying these weights, a neural net is able to learn. 

Deep Learning (DL) is a subset of neural nets that involves the use of multiple layers of interconnected neurons, also known as nodes. The “deep” descriptor refers to multiple layers of nodes within the neural net. Input data flows sequentially through the neural net, one layer at a time, so that in the basic case the output of layer 1 becomes the input to layer 2. More complicated models have different connection patterns that enhance the network to be able to learn more efficiently or process different types of data. 

Neural networks traditionally relied on domain specific knowledge to extract features from raw inputs. In contrast, deep learning networks have networks with more layers and different types of connections. In part because of this, they can learn features directly from the raw inputs. Image source: Deepgram

Neural networks traditionally relied on domain specific knowledge to extract features from raw inputs. In contrast, deep learning networks have networks with more layers and different types of connections. In part because of this, they can learn features directly from the raw inputs. Image source: Deepgram

Deep Learning models are particularly good at tasks that involve complex, unstructured data such as images, audio, and text. Because of their large size, deep learning models may overfit on simple datasets, meaning that they memorize rather than generalize over the dataset. Thus, deep models generally require complex data.

Training a Neural Net

Neural nets, including deep learning models, typically have a learning mode (training) and a running mode (inference).

Training creates a neural network capable of performing a certain task; inference is when the NN is used on “production” data (image source)

Training creates a neural network capable of performing a certain task; inference is when the NN is used on “production” data (image source)

During Inference, we run new data through a trained neural net to make predictions or decisions, as users of deep neural networks do in production. Here, we worry about performance and accuracy. Are results returned in a timely manner and are they the expected results?  If not, research and engineering collaborate to find the root cause and fix it. Sometimes, this fix involves retraining the network, like in the case where a customer’s specific data is different enough from the data that a more general model was originally trained on. Also, neural nets are often viewed as black boxes in that it is difficult to explain why outputs are arrived at. This lack of explainability can be an issue in certain domains like decision support systems, where networks aid human experts in making critical decisions.

During Training, the neural net’s weights are adjusted to create a model that can perform a specific task, like classifying images with dogs in them. We can either learn from examples that have labels (‘Dog’, ‘Cat’), or from unlabeled examples where we don’t know the labels, but we might know something else about the data - like there is a folder that contains images of cats or dogs, but we don’t know the contents of individual images. 

Labeled examples are expensive to generate as it typically requires a human labeller to look at images or listen to audio and label them. We can try to learn from patterns in the data even when labels aren’t present. This type of learning is called unsupervised learning. For example, if we know there are two types of labels (cats and dogs), we could look to categorize the images into two category bins such that the bins are as different as possible from each other. This means we are assuming that dog images are more like each other than cat images. Deepgram diarization uses elements of unsupervised learning by clustering embeddings (derived features) from audio data to determine who is speaking.

For speaker diarization, unsupervised learning is used to cluster embeddings, each of which represents a unique speaker (image source)

For speaker diarization, unsupervised learning is used to cluster embeddings, each of which represents a unique speaker (image source)

More commonly, we learn from examples (supervised learning) using data (e.g. images) along with labels (‘Dog’, ‘Cat’). We then break up our dataset into a training set and a validation set. 

Next, we then use an algorithm called error backpropagation (backprop for short) that takes the model outputs (‘Dog’, ‘Cat’), compares them to the training set labels and adjusts the network's weights to nudge it towards producing better outputs the next time around. Everytime we make it through all the images in the training set, we call it a training epoch. Periodically, we will also run the partially trained model on the validation dataset. These validation outputs will tell us how well the model is generalizing to data it hasn’t been trained on as well as letting us know when we should stop the training loop. 

Before Deep Learning

Notable breakthroughs in the history of Neural Nets (image source)

Notable breakthroughs in the history of Neural Nets (image source)

At the heart of deep learning are small processing units called nodes or neurons. The name neuron is a call-out to its history of being in part derived by biological neuron models. These neurons take in input from other neurons, perform a computation, and then pass on output to other neurons. By connecting neurons together in networks, complex computations become possible. These ideas at the root of deep learning can be traced back to the 1940s and 50s, when the first artificial neural net (McCulloch-Pitts neuron) was proposed. This advancement happened around the same time as some of the early foundational research on the electrical activity of biological neurons (Hodgkin-Huxley model from experiments on giant squids).

The Rosenblatt perceptron (1957) is one of the earliest neural net models for learning to classify data using adjustable weights. (See Perceptron diagram below.) In 1974, Paul Werbos described error backpropagation, the  foundational method mentioned earlier that is used to train multi-layer neural nets to learn from prediction errors. The algorithm didn’t initially get widespread use, and was rediscovered at least twice into the 1980s. However, that doesn’t mean we could have had ChatGPT ten years earlier, as compute power and data availability would still take some time to come to fruition.

Diagram of Rosenblatt's perceptron.

Diagram of Rosenblatt's perceptron.

Through the 80s and 90s, ML statistical methods like support vector machines (SVMs) and Hidden-Markov models (HMMs) were highly tailored to specific domains like handwriting recognition. Neural nets were limited to shallow nets (i.e. a small number of layers) due to limitations in compute, algorithms and data.

Hidden Markov Models

HMM Block Diagram: If you walk for two days then shop on the third, what is the most likely sequence of rainy and sunny days on those three days? (image source)

HMM Block Diagram: If you walk for two days then shop on the third, what is the most likely sequence of rainy and sunny days on those three days? (image source)

Let’s look at Hidden Markov Models (HMMs) in particular, as these models were the bread and butter of ASR for many years. Starting in the early 1980s, HMMs were used to model how often different phonemes (the smallest unit of sound in a language) occurred. HMMs led to significant improvements in ASR and NLP.

HMMs are a probabilistic model used with time-series data, where the state (i.e. the transcription) is not directly observable, but can be inferred from observed data (i.e. the audio). For ASR, given a transcription of fixed length, an HMM can be used to estimate the most likely sequence of audio snippets that generated the data using spiffy-sounding algorithms like Viterbi and Baum-Welch. 

Despite their success, HMMs have a number of drawbacks that Deep Learning techniques were later developed to address. Said drawbacks include the inability to model long-range dependencies between inputs (i.e. what is the specific thing to which a given pronoun refers?) and difficulty in handling variable length-sequences.

Let’s go Deeper

The term "deep learning" was first coined in 2006 by AI pioneer Geoffrey Hinton. Deep learning differs from machine learning in that deep learning systems can learn features directly from input data, whereas traditional ML algorithms require features to be hand-engineered by domain experts (more on feature learning below). DL models benefit from the availability of powerful GPUs, which enable them to perform computations much faster than traditional CPUs. 

Geoffrey Hinton coined the term “deep learning” (image source)

Geoffrey Hinton coined the term “deep learning” (image source)

Let’s look at a couple types of DL models.

Convolutional Neural Networks and Feature Learning

Convolutional Neural Networks (CNNs) are a type of DL model that regularly achieve state-of-the-art performance at tasks such as image recognition, object detection, and image segmentation. 

Think of the convolutions as learned filters that analyze small sections of an image to capture something important, like an edge or a shape. Outputs are pooled together and then re-analyzed by further convolutions, which try to make sense out of larger sections of the image with each pass. For example, later layers might act as face filters rather than edge filters. Finally, the CNN’s fully-connected output layer produces a prediction or decision about what the image shows.

Convolutional Neural Networks are used to identify features (patterns) in images (image source)

Convolutional Neural Networks are used to identify features (patterns) in images (image source)

But if CNNs are designed to process images, what do they have to do with audio? CNNs can also be used for audio tasks via spectrograms which are visual representations of audio signals widely used for analyzing audio signals. A spectrogram plots audio frequency over time. The key challenge in using CNNs with spectrograms is to determine an appropriate filter size: large enough to capture relevant information, but not so large as to lose details about how the audio is changing.

CNNs process audio by converting the audio signal (top) into spectrogram images (bottom) and then analyzing the spectrogram image for spatial patterns.

CNNs process audio by converting the audio signal (top) into spectrogram images (bottom) and then analyzing the spectrogram image for spatial patterns.

Generative AI

Unlike other types of AI, which are focused on making predictions or classifying existing data, generative AI is designed to generate new data similar to its training data. Generative AI can be used to create new images, music, or text. It can also raise ethical concerns, particularly around the creation of convincing fake content, such as deepfakes, as well as raising important questions about creative attribution and intellectual property for writers and artists alike.

Generative AI can learn a style, which can be applied to an arbitrary input (aka style transfer) for interesting artistic effect (image source)

Generative AI can learn a style, which can be applied to an arbitrary input (aka style transfer) for interesting artistic effect (image source)

One early type of Generative AI is the Generative Adversarial Network (GAN), created by Ian Goodfellow in 2014. A GAN’s generator learns to generate plausible data which become negative training examples for its discriminator. The discriminator is simply a classifier which tries to distinguish real data from the data created by the generator. One can think of the generator as an art forger creating fake works of art that it wants to sell as if they were real, while the discriminator is an art critic whose sole purpose is to distinguish fake paintings from real ones to prevent the buyer from wasting their money. 

Probably the most exciting generative model is the Generative Pretrained Transformer aka GPT, built around a Transformer model, which we’ll discuss in the next section.

Sequence Learning: RNNs, LSTMs, and Transformers

Sequence learning refers to teaching a model to output a sequence (i.e. a series of words as a transcription) from input data that is also a sequence (i.e. audio). Transformers are the go-to architecture for many sequence-based tasks, including language modeling, machine translation, and text classification. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, although slightly dated now, still enjoy popularity for certain sequence-data use cases. 

Both RNNs and LSTMs are like super-smart movie watchers that are really good at understanding sequences of events. But there's a key difference between the two.

Think of it like this: RNNs are like movie watchers that have a really good short-term memory. They can remember what happened in the scene they just watched, but they might have trouble remembering what happened earlier in the movie.

On the other hand, LSTMs are like movie watchers that have both a short-term and long-term memory. They can remember what happened earlier in the movie, as well as what happened in the current scene. This makes them better at understanding the overall context of the movie, and making predictions about what might happen next.

LSTMs benefit from longer-term memory than RNNs (image source)

LSTMs benefit from longer-term memory than RNNs (image source)

In other words, while RNNs can be good at predicting what comes next in a sequence of events, they might have trouble understanding the bigger picture. LSTMs, on the other hand, can better understand the long-term relationships between events, which makes them better suited for tasks like language translation or generating new text that makes sense.

RNNs and LSTMs create a sense of memory through recurrent (backward) connections. In a simple feed-forward network (one without recurrent connections), an input is fed into the network to produce an output, but there is no correlation between inputs. Recurrence is one way to solve this problem, as it moves the network from being stateless to stateful, i.e. the backward connections add a memory to the network (like a flip-flop for the Electrical Engineers out there) at the cost of complexity and possible instability. And as sequences get longer, it gets more difficult to train these networks as the prediction error has to be back-propagated farther and farther back in time. This vanishing gradient problem was one of the reasons transformers were developed.

Transformers work differently than RNNs and LSTMs. Instead of processing sequences of information one step at a time, like RNNs and LSTMs do, Transformers process the entire sequence all at once. They do this by using a self-attention mechanism, which allows them to focus on the most important parts of the input sequence.

Think of it like this: imagine you're trying to summarize a really long book. Instead of reading the book one sentence at a time, you might first scan the entire book to get an idea of what it's about, and then focus your attention on the most important parts. This is similar to how Transformers work - they scan the entire input sequence to get an idea of what it's about, and then focus their attention on the most important parts to make predictions.

This key innovation of transformers is called the self-attention mechanism, which allows the network to selectively focus on different parts of the input sequence when making predictions. Self-attention works by computing a weighted sum of all the positions in the input sequence, where the weights are learned by the network during training. Darker connections in the figure above represent stronger weights. This innovation allows the network to attend to the most relevant parts of the input sequence at each step of the prediction process, rather than processing the entire sequence sequentially as in an RNN. It also makes them adept at dealing with variability within speech and modeling long-range dependencies. For example, a transformer would be able to infer that ‘It’ at the beginning of the last sentence refers to ‘self-attention mechanism’. 

Transformers are very powerful and have achieved state-of-the-art results on many natural language processing tasks. They're particularly good at understanding long-range dependencies and relationships between words, which makes them well-suited for tasks like language translation, where you need to understand the overall context of a sentence in order to accurately translate it to another language.

In older statistical language modeling techniques we might use an n-gram to help learn the relationships between words. N-grams are a simple approach that breaks up a sentence into smaller "chunks" of words called n-grams. For example, if we had the sentence "The quick brown fox jumped over the lazy dog", we could break it up into 2-grams like this: "The quick", "quick brown", "brown fox", "fox jumped", "jumped over", "over the", "the lazy", "lazy dog".

By counting the frequency of each n-gram in a large corpus of text, we can get an idea of how likely it is that a particular sequence of words will appear together in a sentence. This can be used for tasks like language modeling and text classification. However, n-grams fall apart with the long-distance dependency that was noted above. The name Deepgram is in part inspired by this problem and its solution: a deep-learning version of an n-gram. Transformers are a big part of that, which we’ll talk about more in the next part of this blog series.

GPUs and Data

There were three advances that we mentioned which enabled the deep learning revolution: advances in algorithms, compute and data availability. We’ve spent a bit of time describing various algorithms and models, so let’s touch briefly on the other two items.

The availability of large amounts of data, particularly in areas like computer vision and speech recognition, meant that neural networks could be trained with much more data than before for understanding complex patterns. This data has largely been mined directly from various locations (e.g. youtube audio and transcriptions) or by way of crowdsourcing. Crowdsourcing is a process of asking or paying people in a distributed way to label data and was used to create ImageNet, a famous dataset from 2009.

CPUs need to reserve chip area for caches and control units while GPUs can dedicate most of their transistors for data processing (image source)

CPUs need to reserve chip area for caches and control units while GPUs can dedicate most of their transistors for data processing (image source)

Finally, the development of powerful GPUs (graphics processing units) made it possible to train neural networks much faster than before. Why? Well, it starts with the job that GPUs were designed for: driving potentially millions of pixels on computer displays in basically real-time. Unlike CPU architectures, which can process, at most, a Acouple-few dozen computational threads at a time, GPUs are designed to be “embarrassingly parallel,” which means that GPU hardware can process a lot more data, simultaneously, than even the highest-performing CPUs on the market. As if by a stroke of luck, it turns out that GPUs’ highly parallelized architecture was well suited to handle the math-intensive work of machine learning training and inference tasks. This meant that researchers could experiment with different neural network architectures and train them on larger datasets in less time; in other words, widely-available GPU hardware is an enabling technology for advances in deep learning, and paves the way for more specialized silicon architecture that is specifically designed for running ML training and inference tasks. 

Code Frameworks

It is important to keep in mind that the models we’ve described in this blog are built on deep learning code frameworks, which make these modeling innovations possible. DL researchers these days tend to code in the PyTorch framework—initially released by Meta (née Facebook) in 2016. TensorFlow is another popular framework, but it is often considered to have a steep learning curve which can make rapid research iterations more difficult. PyTorch offers great flexibility, community support, and ease of use. 

For deep learning engineers, PyTorch provides optimized ways to load data, build and train models and visualize results. 

You may be aware that Deepgram uses Rust in production and part of the job of the engineering team is to convert the research PyTorch models into Rust. Why do we go through that effort? It’s because the Python programming language, which PyTorch is built in, relies on a Global Interpreter Lock (GIL) to optimize its single-threaded computations. However, this optimization has a negative impact on multithreaded computation which our models could benefit from. Hence, Rust is a speed optimization that allows us to bypass the GIL. 

Why don’t researchers code in Rust directly? Sometimes they do, but there is a steep learning curve and researchers tend to find it much easier to quickly experiment with models in a Python-based framework.

Stay Tuned

Deep learning is a rapidly evolving field that will continue to transform heretofore mostly-manual workflows across many industries. By learning directly from data, DL models are able to solve complex problems that were previously impossible. 

Next time, we’ll dive more into how deep learning is applied to audio data and use that understanding to take a peek under the hood of DG Research’s latest ASR models like Nova.

Trivia: How many “AI Winters” have there been?  Stay tuned to find out.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo