Article·Tutorials·Aug 1, 2022

A Guide to DeepSpeech Speech to Text

Yujian Tang
By Yujian Tang
PublishedAug 1, 2022
UpdatedJun 13, 2024

No, we’re not talking about you Cthulhu. This is a different type of DeepSpeech. The DeepSpeech we’re talking about today is a Python speech to text library. Speech to text is part of Natural Language Processing (NLP). Automated speech recognition, or ASR, started out as an offshoot of NLP in the 1990s.

Today, there are tons of audio libraries that can help you manipulate audio data such as DeepSpeech and PyTorch. In this post, we will be using DeepSpeech to do both asynchronous and real time speech transcription. We will cover:

What is DeepSpeech?

DeepSpeech is an open source Python library that enables us to build automatic speech recognition systems. It is based on Baidu’s 2014 paper titled Deep Speech: Scaling up end-to-end speech recognition.

The initial proposal for Deep Speech was simple - let’s create a speech recognition system based entirely off of deep learning. The paper describes a solution using RNNs trained with multiple GPUs with no concept of phonemes. The authors, Hannun et al., show that their solution also outperformed the existing solutions at the time and was more robust to background noise without a need for filtering.

Since then, Mozilla has been the one in charge of maintaining the open source Python package for DeepSpeech. Before moving on, it’s important to note that DeepSpeech is not yet compatible with Python 3.10 nor some more recent versions of *nix kernels. I suggest using a virtual machine or Docker container to develop with DeepSpeech on unsupported OSes.

Set Up for Local Speech to Text with DeepSpeech

To use DeepSpeech, we have to install a few libraries. We need deepspeechnumpy, and webrtcvad. We can install all of these by running pip install deepspeech numpy webrtcvad. The webrtcvad library is the voice activity detection (VAD) library developed by Google for WebRTC (real time communication).

For the asynchronous transcription, we’re going to need three files. One file to handle interaction with WAV data, one file to transcribe speech to text on a WAV file, and one to use these two in the command line. We will also be using a pretrained DeepSpeech model and scorer. You can download their model by running the following lines in your terminal:

File Handler for DeepSpeech Speech Transcription

The first file we create is the WAV handling file. This file should be named something like wav_handler.py. We import three built-in libraries to do this, wavecollections, and contextlib. We create four functions and one class. We need one function each to read WAV files, write WAV files, create Frames, and detect voice activity. Our one class represents individual frames in the WAV file.

Reading Audio Data from a WAV file

Let’s start with creating a function to read WAV files. This function will take an input, this input is the path to a WAV file. The file will use the contextlib library to open the WAV file and read in the contents as bytes. Next, we run multiple asserts on the WAV file - it must have one channel, have a sample width of 2, have a sample rate of 8, 16, or 32 kHz.

Once we have asserted that the WAV file is in the right format for processing, we extract the frames. Next, we extract the pcm data from the frames and the duration from the metadata. Finally, we return the PCM data, the sample rate, and the duration.

Writing Audio Data to a WAV file

Now let’s create the function to write audio data to a WAV file. This function requires three parameters, the path to a file to write to, the audio data, and the sample rate. This function writes a WAV file in the same way that the read function asserts its parameters. All we do is here is set the channels, sample width, and frame rate and then write the audio frames.

Creating Frames of Audio Data for DeepSpeech to Transcribe

We’re going to create a class called Frame to hold some information to represent our audio data and make it easier to handle. This object requires three parameters to be created: the bytes, the timestamp in the audio file, and the duration of the Frame.

We also need to create a function to create frames. You can think of this function as a frame generator or a frame factory that returns an iterator. This function requires three parameters: the frame duration in milliseconds, the audio data, and the sample rate.

This function starts by deriving an interval of frames from the passed in sample rate and frame duration in milliseconds. We start at an offset and timestamp of 0. We also create a duration constant equal to the number of frames in a second.

While the current offset can be incremented by the interval constant and be within the number of frames of the audio, we generate a Frame for each interval and then increment the timestamp and offset appropriately.

Collecting Voice Activated Frames for Speech to Text with DeepSpeech

Next, let’s create a function to collect all the frames that contain voice. This function requires a sample rate, the frame duration in milliseconds, the padding duration in milliseconds, a voice activation detector (VAD) from webrtcvad, and the audio data frames.

The VAD algorithm uses a padded ring buffer and checks to see what percentage of the frames in the window are voiced. When the window reaches a 90% voiced frame rate, the VAD triggers and begins yielding audio frames. While generating frames, if the percentage of voiced audio data in the frame drops below 10%, it will stop generating frames.

Transcribe Speech to Text for WAV file with DeepSpeech

We’re going to create a new file for this section. This file should be named something like wav_transcriber.py. This layer completely abstracts out WAV handling from the CLI (which we create below). We use these functions to call DeepSpeech on the audio data and transcribe it.

Pick Which DeepSpeech Model to Use

The first function we create in this file is the function to load up the model and scorer for DeepSpeech to run speech to text with. This function takes two parameters, the models graph, which we create a function to produce below, and the path to the scorer file. All it does is load the model from the graph and enable the use of the scorer. This function returns a DeepSpeech object.

Speech to Text on an Audio File with DeepSpeech

This function is the one that does the actual speech recognition. It takes three inputs, a DeepSpeech model, the audio data, and the sample rate.

We begin by setting the time to 0 and calculating the length of the audio. All we really have to do is call the DeepSpeech model’s stt function to do our own stt function. We pass the audio file to the stt function and return the output.

DeepSpeech Model Graph Creator Function

This is the function that creates the model graph for the load_model function we created a couple sections above. This function takes the path to a directory. From that directory, it looks for files with the DeepSpeech model extension, pbmm and the DeepSpeech scorer file extension, .scorer. Then, it returns both of those values.

Voice Activation Detection to Create Segments for Speech to Text

The last function in our WAV transcription file generates segments of text that contain voice. We use the WAV handler file we created earlier and webrtcvad to do the heavy lifting. This function requires two parameters: a WAV file and an integer value from 0 to 3 representing how aggressively we want to filter out non-voice activity.

We call the read_wave function from the wav_handler.py file we created earlier and imported above to get the audio data, sample rate, and audio length. We then assert that the sample rate is 16kHz before moving on to create a VAD object. Next, we call the frame generator from wav_handler.

We convert the generated iterator to a list which we pass to the vad_collector function from wav_handler along with the sample rate, frame duration (30 ms), padding duration (300 ms), and VAD object. Finally, we return the collected VAD segments along with the sample rate and audio length.

DeepSpeech CLI for Real Time and Asynchronous Speech to Text

Everything is set up to transcribe audio data with DeepSpeech via pretrained models. Now, let’s look at how to turn the functionality we created above into a command line interface for real time and asynchronous speech to text. We start by importing a bunch of libraries for operating with the command line - sysosloggingargparsesubprocess, and shlex. We also need to import numpy and the wav_transcriber we made above to work with the audio data.

Reading Arguments for DeepSpeech Speech to Text

We create a main function that takes one parameter - args. These are the arguments passed in through the command line. We use the argparse libraries to parse the arguments sent in. We also create helpful tips on how to use each one.

We use aggressive to determine how aggressively we want to filter. audio directs us to the audio file path. model points us to the directory containing the model and scorer. Finally, stream dictates whether or not we are streaming audio. Neither stream nor audio is required, but one or the other must be present.

Using DeepSpeech for Real Time or Asynchronous Speech Recognition

This is still inside the main function we started above. Once we parse all the arguments, we load up DeepSpeech. First, we get the directory containing the models. Next, we call the wav_transcriber to resolve and load the models.

If we pass the path to an audio data file in the command line, then we will run asynchronous speech recognition. The first thing we do for that is call the VAD segment generator to generate the VAD segments and get the sample rate and audio length. Next, we open up a text file to transcribe to.

For each of the enumerated segments, we will process each chunk by using numpy to pull the segment from the buffer and the speech to text function from wav_transcriber to do the speech to text functionality. We write to the text file until we run out of audio segments.

If we pass stream instead of audio, then we open up the mic to stream audio data in. If you don’t need real time automatic speech recognition, then you can ignore this part. First, we have to spin up a subprocess to open up a mic to stream in real time just like we did with PyTorch local speech recognition.

We use the subprocess and shlex libraries to open the mic to stream voice audio until we shut it down. The model will read 512 bytes of audio data at a time and transcribe it.

Summary

We started this post out with a high level view of DeepSpeech, an open source speech recognition software. It was inspired by a 2014 paper from Baidu and is currently maintained by Mozilla.

After a basic introduction, we stepped into a guide on how to use DeepSpeech to locally transcribe speech to text. While it may have been possible to create all of this code in one document, we opted for a modular approach with principles of software engineering in mind.

We created three modules. One to handle WAV files, which are the audio data files that we can use DeepSpeech to transcribe. One to transcribe from WAV files, and one more file to create a command line interface to use DeepSpeech. Our CLI allows us to pass in options to pick if we want to do real time speech recognition or run speech recognition on an existing WAV audio file.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.