Over the past decade, “neural networks” (NNs) have transformed from an obscure, purely theoretical concept to Silicon Valley’s hot new buzzword. Now, they’re everywhere. From Google Translate to Tiktok, it’s nearly impossible to go a single day without somehow interacting with neural networks.

But what exactly are they?

Well, for a machine to be (artificially) intelligent, it must have a brain. And that brain takes the form of a neural network.

Together, we’ll walk through exactly how NNs work, assuming no background knowledge of AI whatsoever. Yes, the world of artificial intelligence and neural networks is vast, but upon closer inspection, we’ll see that they’re actually more approachable than meets-the-eye.

Neural Networks by Example

Let’s say we want to build a speech recognition app. (Deepgram knows a thing or two about automatic speech recognition using deep learning, so that’s why we’re sticking to audio for this example. These same core principles can be applied to other types of data too.)

If a user says a word, we want our computer output what the user said. To simplify this example, let’s say there are only four words the user can say: “Reindeer,” “Raining,” “Adhere,” and “Adding.”

So the input (type: waveform) and output (type: string) would look like this:

So while speech recognition is simple for humans, it seems a bit more daunting to program a computer to do this. Computers don’t have ears, so they’ll have to somehow mathematically parse the numerical values expressed in .mp3 or .wav files and turn those numbers into language.

Oh yeah, one more thing: waveforms are like snowflakes. No two are alike. Even if the same speaker says the same word twice in the same way, the waveforms of those two utterances will look different.

What does this mean for us? Well it means that we can’t simply map waveforms to letters. There are an infinite number of subtly different ways the word “reindeer” can look like as a waveform. And the same goes for the syllable “deer” or even the letter “r”.

Also note that letters are unreliable too. The “O” in “women” and the “I” in “win” and the “e” in “Nguyen” all sound the same.

So how on earth can a neural network learn to recognize speech? Let’s break it down.

First, we’ll talk about neurons. Then we’ll talk about how those neurons are connected. Finally, we’ll end by looking at the NN in action.


A neuron is just a variable. That’s all. It’s a variable whose value is equal to some number between 0 and 1. Every circle in the diagram of a neural network is a neuron.

See? Not so scary. Let’s move on.

How Neurons Connect

Look at the structure of the neural network we’re using for our speech recognition example. We start off with one layer that contains loads of neurons. Each successive layer then has some other number of neurons. And the final layer has four neurons. It might look something like this:

The four neurons at the end each represent one of the four possible output words of our program. And the value that each of these neurons contains represents the probability that the user said that word.

Now let’s look at the first layer (read: input layer). How do we assign values here?

Well, note that there are 128 neurons in this layer. This means we split up our input waveform into 128 equally sized chunks. The value of the first neuron is equal to the average recorded frequency (in kilohertz) of the first chunk. The second neuron is equal to the average recorded frequency of the second chunk. And so on and so forth. We illustrate this process in the images below.

This image illustrates the act of slicing the audio into 128 equal chunks. Note that the number 128 was a somewhat arbitrary design choice (read: hyperparameter).

Here, we measure the average frequency of each audio slice in kilohertz. Then, we place all of those frequencies into a list, in order. This list represents the values in each neuron of the first layer of our speech-recognition NN.

Great! So now we know what’s in our first and last layers. But what about those middle layers?

This is where we see the NN in action.

Note: The functionality we’re about to go through is simply a model of how NNs work. It is not a perfect replica of what occurs under the hood. Much like how the model of an atom in chemistry textbooks is not actually what an atom looks like, the following walkthrough of NN functionality is a purely pedagogical model. The goal here is to attain a mental image of the “guts” of an NN. We’ll get into the legitimate math in a future blog post.

Neural Networks in Action

The premise of the NN's layered structure is as follows:

We start off by chopping the audio down into small chunks. The goal is for each successive layer to be able to recognize increasingly complex parts of the audio. For example, if the first layer just contains one neuron for each frequency chunk, then the second layer should be smart enough to see which chunks should be grouped together to form letter-sounds.

If we conceptualize neural networks like this, then the second layer should have one neuron for each possible letter (including letter-groups, like 'sh' or 'ng'). And the letters it thinks are most likely to have occurred in the waveform should correspond to the neurons with the highest values.

The third layer does the same thing, just on a higher level. Instead of piecing together frequencies into letter-sounds, it will piece together letter-sounds into syllables.

Finally, our output layer pieces together syllables into words.

Let’s dissect the image above. First, we have our input .wav file, which is broken down into 128 slices of audio. Each slice represents a sound of some frequency. For human voices, the range of possible frequencies is somewhere in the neighborhood of 0.08 kilohertz and 0.255 kilohertz.

Our fully-trained, fully fine-tuned, fully-tested neural network then takes a look at the audio slices we’ve given it, and tries to piece them together into letters. In the completed image below, we see that the NN “hears” the letter ‘R’ because the value of the neuron that corresponds to the letter ‘R’ is close to 1. Meanwhile, the NN is indicating that it does not hear the sound ‘I’ because the value of the neuron that corresponds to the letter ‘I’ is close to 0.

In the next layer, we see that the NN was able to piece together the syllable “DEER”’ based on the syllables it heard. Likewise, the NN is indicating that it hears “RAYN” as well. The other possible syllables—”ING” and “ADD”—were not heard.

Finally, once the NN knows the syllables at play, it pieces together the words. The final layer of neurons represents a probability distribution. There is a 96% chance that the input waveform is a recording of someone saying “Reindeer.” On the other hand, there is a 2% chance that the word was “Raining,” a 1.9% chance that the word was “Adhere,” and a 0.1% chance that the word was “Adding.”

Note that the hidden layers do not represent probability distributions. Rather, each neuron in the hidden layer has a value associated with the "strength" of its belief. For example, the high value associated with 'RAYN' (0.85) indicates that the neural network believes RAYN was uttered. Meanwhile, the low value associated with the "ING" neuron indicates that the NN doesn't believe “ING” was uttered.

And that’s it! That’s the structure of a Neural Network. Note that if our input waveform were different, the values inside each neuron would be different as well. This is good since we want our NN to be dynamic—capable of handling any input and arriving at a valid (probability-based) output.

Now, I’m sure you have questions. Questions such as:

  • How are the letters pieced together between the first and second layers?

  • How does the NN know how to piece together syllables?

  • What do you mean by “fully-trained, fully fine-tuned, fully-tested” Neural Network?

  • What’s the difference between this model example and what really goes on inside an NN?

Great questions! Like I said earlier, the world of Neural Networks is vast. Vast but approachable. We’ll have to tackle those questions in a future post.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo