Article·Tutorials·Jun 27, 2022

Introduction to PyTorch Audio Data via TorchAudio

Yujian Tang
By Yujian Tang
PublishedJun 27, 2022
UpdatedJun 13, 2024

PyTorch is one of the leading machine learning frameworks in Python. Recently, PyTorch released an updated version of their framework for working with audio data, TorchAudio. TorchAudio supports more than just using audio data for machine learning. It also supports the data transformations, augmentations, and feature extractions needed to use audio data for your machine learning models.

In this post, we'll cover:

Setting up PyTorch TorchAudio for Audio Data Augmentation

At the time of writing, torchaudio is on version 0.11.0 and only works with Python versions 3.6 to 3.9. For this example, we’ll be using Python 3.9. We’ll also need to install some libraries before we dive in. The first libraries we’ll need are torch and torchaudio from PyTorch. We’ll be using matplotlib to plot our visual representations, requests to get the data, and librosa to do some more visual manipulations for spectrograms.

To get started we’ll pip install all of these into a new virtual environment. To start a virtual environment run python3 -m venv <new environment name>. Then run pip install torch torchaudio matplotlib requests librosa and let pip install all the libraries necessary for this tutorial.

Adding Effects for Audio Data Augmentation with PyTorch TorchAudio

Recently, we covered the basics of how to manipulate audio data in Python. In this section we’re going to cover the basics of how to pass sound effect options to TorchAudio. Then, we’ll go into specifics about how to add background noise at different sound levels and how to add room reverb.

Before we get into that, we have to set some stuff up. This section of code is entirely auxiliary code that you can skip. It would be good to understand this code if you’d like to continue testing on the provided data.

In the code block below, we first import all the libraries we need. Then, we define the URLs where the audio data is stored and the local paths we’ll store the audio at. Next, we fetch the data and define some helper functions.

For this example, we’ll define functions to get a noise, speech, and reverb sample. We will also define functions to plot the waveform, spectrogram, and numpy representations of the sounds that we are working with.

Using Sound Effects in Torchaudio

Now that we’ve set everything up, let’s take a look at how to use PyTorch’s torchaudio library to add sound effects. We’re going to pass a list of list of strings (List[List[Str]]) object to the sox_effects.apply_effects_tensor function from torchaudio.

Each of the internal lists in our list of lists contains a set of strings defining an effect. The first string in the sequence indicates the effect and the next entries indicate the parameters around how to apply that effect. In the example below we show how to add a lowpass filter, augment the speed, and add some reverb. For a full list of sound effect options available, check out the sox documentation. Note: this function returns two return values, the waveform and the new sample rate.

The printout from plotting the waveforms and spectrograms are below. Notice that adding the reverb necessitates a multichannel waveform to produce that effect. You can see the difference in the waveform and spectrogram from the effects. Lowering the speed lengthened the sound. Adding a filter compresses some of the sound (visible in the spectrogram). Finally, the reverb adds noise we can see reflected mainly in the “skinnier” or quieter sections of the waveform.

Above: Original Waveform and Spectrogram + Added Effects from TorchAudio

Adding Background Noise

Now that we know how to add effects to audio using torchaudio, let’s dive into some more specific use cases. If your model needs to be able to detect audio even when there’s background noise, it’s a good idea to add some background noise to your training data.

In the example below, we will start by declaring a sample rate (8000 is a pretty typical rate). Next, we’ll call our helper functions to get the speech and background noise and reshape the noise. After that, we’ll use the norm function to normalize both the speech and the text to the second order. Next, we’ll define a list of decibels that we want to play the background noise at over the speech and create a “background noise” version at each level.

The above pictures show the waveform and the spectrogram of the background noise. We have already created all the noise speech audio data clips in the code above. The code below prints all of them out so we can see what the data looks like at different levels of audio. Note that the 20dB snr means that the signal (speech) to noise (background noise) ratio is at 20 dB, not that the noise is being played at 20 db.

Above: 20 and 10 dB SNR added background noise visualizations via PyTorch TorchAudio

Above: 3 dB signal to noise ratio waveform and spectrogram for added background noise

Adding Room Reverberation

So far we’ve applied audio effects and background noise at different noise levels. Let’s also take a look at how to add a reverb. Adding reverb to an audio clip gives the impression that the audio has been recorded in an echo-y room. You can do this to make it seem like a presentation you gave to your computer was actually given to an audience in a theater.

To add a room reverb, we’re going to start by making a request for the audio from where it lives online using one of the functions we made above (get_rir_sample). We’ll take a look at the waveform before we clip it to get the “reverb” of the sound, normalize it, and then flip the sound so that the reverb works correctly.

Above: Original and augmented reverb sound visualizations from PyTorch TorchAudio

Once we have the sound normalized and flipped, we’re ready to use it to augment the existing audio. We will first use PyTorch to create a “padding” that uses the speech and the augmented sound. Then, we’ll use PyTorch to apply the sound with a 1 dimensional convolution.

Above: Visualizations for audio with reverb applied by TorchAudio

From the printout above we can see that adding the room reverb adds echo like sounds to the waveform. We can also see that the spectrogram is less defined than it would be for a crisp, next-to-the-mic sound.

Advanced Resampling of Audio Data with TorchAudio

We briefly mentioned how to resample data before using the pydub and the sklearn libraries. TorchAudio also lets you easily resample audio data using multiple methods. In this section, we’ll cover how to resample data using low-pass, rolloff, and window filters.

As we have done above, we need to set up a bunch of helper functions before we get into actually resampling the data. Many of these setup functions serve the same functions as the ones above. The one here to pay attention to is get_sine_sweep which is what we’ll be using instead of an existing audio file. All the other functions like getting ticks and reverse log frequencies are for plotting the data.

I put the two torchaudio imports here to clarify that these are the T and F letters we’ll be using to pull functions from (as opposed to true and false!). We’ll declare a sample rate and a resample rate, it doesn’t really matter what these are, feel free to change these as it suits you.

The first thing we’ll do is create a waveform using the get_sine_sweep function. Then, we’ll do a resampling without passing any parameters. Next, we’ll take a look at what the sweeps look like when we use a low pass filter width parameter. For this, we’ll need the functional torchaudio package.

Technically, there are infinite frequencies, so a low pass filter cuts off sound below a certain frequency. The low pass filter width determines the window size of this filter. Torchaudio’s default is 6 so our first and second resampling are the same. Larger values here result in “sharper” noise.

Above: Basic and Low Pass Filter Example Spectrogram from TorchAudio

Filters are not the only thing we can use for resampling. In the example code below, we’ll be using both the default Hann window and the Kaiser window. Both windows serve as ways to automatically filter. Using rolloff for resampling achieves the same goals. In our examples, we’ll take a rolloff of 0.99 and 0.8. A rolloff represents what proportion of the audio will be attenuated.

Above: Windowed and Rolloff parameter resampling visualizations from TorchAudio

Audio Feature Extraction with PyTorch TorchAudio

So far we’ve taken a look at how to use torchaudio in many ways to manipulate our audio data. Now let’s take a look at how to do feature extraction with torchaudio. As we have in the two sections above, we’ll start by setting up.

Our setup functions will include functions to fetch the data as well as visualize it like the “effects” section above. We also add some functions for doing Mel scale buckets. We will use Mel scale buckets to make Mel-frequency cepstral coefficients (MFCC), these coefficients represent audio timbre.

The first thing we’re going to do here is plot the spectrogram and reverse it. The waveform to spectrogram and then back again. Why is converting a waveform to a spectrogram useful for feature extraction? This representation is helpful for extracting spectral features like frequency, timbre, density, rolloff, and more.

We’ll define some constants before we create our spectrogram and reverse it. First, we want to define n_fft, the size of the fast fourier transform, then the window length (the size of the window) and the hop length (the distance between short-time fourier transforms). Then, we’ll call torchaudio to transform our waveform into a spectrogram. To turn a spectrogram back into a waveform, we’ll use the GriffinLim function from torchaudio with the same parameters we used above to turn the waveform into a spectrogram.

Above: Creating and reversing a spectrogram in PyTorch

Let’s take a look at one of the more interesting things we can do with spectral features, mel-frequency cepstrum. The mel-frequency ceptrsal coefficients (MFCC) represent the timbre of the audio. Before we get started getting these feature coefficients, we’ll define a number of mel filterbanks (256), and a new sample rate to play with.

The first thing we need for MFCC is getting the mel filterbanks. Once we get mel filter banks, we’ll use that to get the mel spectrogram. Now, we’re ready to get the coefficients. First we need to define how many coefficients we want, then we’ll use the mel filterbanks and the mel spectrogram to create an MFCC diagram. This is what our mel spectrogram looks like when reduced to the number of coefficients we specified above.

Above: MFCC Feature Extraction of Audio Data with PyTorch TorchAudio

In Summary

In this epic post, we covered the basics of how to use the torchaudio library from PyTorch. We saw that we can use torchaudio to do detailed and sophisticated audio manipulation. The specific examples we went over are adding sound effects, background noise, and room reverb.

TorchAudio also provides other audio manipulation methods as well, such as advanced resampling. In our resampling examples, we showed how to use multiple functions and parameters from TorchAudio’s functional and transform libraries to resample with different filters. We used low-pass filters, roll off filters, and window filters.

Finally, we covered how to use TorchAudio for feature extraction. We showed how to create a spectrogram to get spectral features, reverse that spectrogram with the Griffin-Lim formula, and how to create and use mel-scale bins to get mel-frequency cepstral coefficients (MFCC) features.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.