Benchmarking Top Open Source Speech Recognition Models: Whisper, Facebook wav2vec2, and Kaldi
Ten years ago, Dan Povey and his team of researchers at Johns Hopkins developed Kaldi, an open-source toolkit for speech recognition. Kaldi quickly became the ASR tool of choice for countless developers and researchers. Since the introduction of Kaldi, GitHub has been inundated with open-source ASR models and toolkits. Some open-source projects you've probably heard of include wav2letter++, openseq2seq, vosk, SpeechBrain, Nvidia Nemo, and Fairseq.
Continuing this trend, in September 2022, OpenAI introduced Whisper, an open-source ASR model trained on nearly 700,000 hours of multilingual speech data. According to OpenAI, Whisper “approaches human level robustness and accuracy on English speech recognition." It also lets you transcribe in almost 100 different languages and translate from several languages into English. I recently had a chance to test it, and I must admit that I was pretty impressed!
Open-source speech models are an important enabler for developers looking to incorporate a voice component into their applications. However, there are also a lot of these models available, so choosing the right one can be difficult. If you're a developer and you're looking to navigate the sea of open-source models, then you will need a few questions answered. First, how do available models compare in terms of usability? Will you have to read 10 papers and 17 blogs, then get your Ph.D. in Turbo Encabulators to get the model working? Or will you be up and running in five minutes after scanning the GitHub README? Second, how do different models perform in terms of accuracy and speed? Will the model get enough words right and be sufficiently fast to adequately serve your use case?
Here, we demonstrate how one could go about answering these questions by comparing some popular open-source models representing three "generations" of ASR technology:
Kaldi
wav2vec 2.0
Whisper
First, we describe the critical axes on which models differ—what we like to call "Model DNA"—and we discuss how different model DNA manifests itself in terms of usability, accuracy, and speed differences across our candidate models. Then comes the fun part: We put the models to the test! First, we benchmark them for accuracy by transcribing real-world audio from five different use cases of interest, including: conversational AI, phone calls, meetings, videos, and earnings calls. Finally, we benchmark the models for inference speed on GPU hardware.
Aspects of Model DNA: What Differentiates One ASR Model from Another?
Trained ASR models vary along a variety of dimensions. Below, we describe a few of the important ones:
Model Architecture
Model architecture refers to a relatively broad collection of characteristics. Most often, model architecture is talked about in terms of the types of neural network layers in the model, the order in which they are set up, and the links between them. Early speech models were actually a "pipeline" of several distinct models (acoustic model, pronunciation model, language model, etc), each with their own unique architecture.
Modern approaches replace all of these components with a single "end-to-end" (e2e) deep learning network. A variety of different layer types have been shown to work well in e2e ASR models including convolutions, recurrent layers, and transformer blocks. In the ASR literature, you can find examples of models using pretty much any combination of these types of layers. However, in the world of available open-source models, the options tend to be a bit more limited.
E2E models can also be "multi-component" with regard to their architecture. Changes along the multi-component axis usually also involve different ways of training and decoding the models. Encoders are single-component models that map a sequence of audio features to the most likely sequence of words. They were the first class of e2e models to be introduced and are still in widespread use today.
They also happen to be the simplest and potentially the fastest of the e2e models. They are usually trained and decoded using an algorithm called Connectionist Temporal Classification (CTC). For our purposes, we only need to know that CTC encoders learn a weak internal representation of language. And as a result, they require some additional heavy machinery (e.g., CTC prefix beam search and language model re-scoring) to achieve high accuracy, which in turn, makes them slow.
Encoder/decoders are two-component models. The encoder produces an "encoded" representation of the audio features, and then an auto-regressive decoder predicts the words present in the audio, one word at a time, conditioning on its previously predicted outputs and using the encoder's output as context. Encoder/decoders can be trained with different combinations of loss functions, but the simplest approach is to apply cross-entropy loss to the decoder output using teacher forcing.
Encoder/decoders have a more complex architecture than standalone encoders because they have more interacting parts. But they learn a much stronger representation of language, and thus produce more accurate predictions than CTC encoders. There are also three-component models, called "transducers," which use an encoder, an auto-regressive decoder, and a third "joint" network that makes predictions based on the output of the other two.
Model Size (or Capacity)
Model capacity generally refers to the cumulative size of the model and is determined by the number of layers and their respective sizes. Many open-source models result from literature studies examining the effect of model capacity on accuracy in an attempt to measure so-called "scaling laws."
These studies typically involve training a sequence of increasing-capacity models where the capacity is incremented by increasing all size parameters simultaneously, in an ad hoc fashion. For a fixed architecture, larger capacity models tend to run more slowly than smaller capacity models because:
They simply require more computation and a lot of that is sequential in nature (i.e. more layers)
When inferencing on GPUs, they usually have to run in smaller batches and can't use batch-wise parallelism because of this.
In many cases, only very large models are open-sourced, which limits their usability for most end users. However, larger capacity models also tend to be more accurate although the extent of this effect depends on the scale of the training data.
Training Data
Open-source models vary considerably in the data which is used to train them. This is important because the ultimate accuracy of an ASR model depends strongly on both the breadth and depth of its training corpus. This dependence is especially crucial in understanding the latent accuracy characteristics of a model and how it generalizes to different types of speech data.
Most open-source models are trained on "academic" datasets like LibriSpeech, which are composed of clean, read speech. Far fewer are trained on real conversational audio with background noise, and even fewer on conversational audio spanning different domains and use cases (e.g., two-person phone calls with background speech, 20-person meetings, podcasts, earnings calls, fast food ordering transactions, etc.).
Audio Pre-processing
Audio pre-processing is a crucial, yet often overlooked component of ASR inference mechanics. It comprises several steps including transcoding the audio into a required format (e.g., 16-bit PCM), resampling it at a specified rate, splitting it into chunks of a specified size, deriving acoustic features (e.g., log-mel spectrograms) over the chunks, and then grouping chunks together to form batches for inference.
Despite its importance, audio-preprocessing is usually not well described in open-source model documentation and may require delving deeply into underlying source code to understand a particular model's audio pre-processing requirements. Open-source models and their associated toolkits offer varying levels of audio pre-processing support. In many cases, you may have to roll your own pipeline.
The Models
Next, let's introduce our candidate models and discuss some of their essential DNA.
Kaldi: Gigaspeech XL
Kaldi is a traditional "pipeline" ASR model composed of several distinct sub-models that operate sequentially. Philosophically, it reflects an academic approach to modeling speech: breaking the problem down into smaller, more manageable chunks and then having dedicated communities of human experts solve each problem chunk separately. Kaldi was eventually supplanted by e2e approaches at the dawn of the deep learning era for speech, when Baidu introduced DeepSpeech.
Despite it having been around for more than a decade as a framework, Kaldi has relatively few open-source models available. For our comparison, we use Kaldi's Gigaspeech XL model which is a conventional pipeline model trained on the recent Gigaspeech dataset.
Gigaspeech comprises 10k hours of labeled, conversational English speech, spanning a few domains. Based on published accuracy data, Gigaspeech XL appears to be the most accurate pipeline model ever produced, achieving competitive results with e2e approaches for in-domain evaluations on Gigaspeech. However, at the time of writing, only the acoustic model weights of the Gigaspeech XL pipeline were available. To use the Gigaspeech model I borrowed the other required components (an ivector embedder and an RNN language model) from the Kaldi LibriSpeech pipeline.
This project was my first time using the Kaldi framework. From a usability perspective, I found it to be very tedious and difficult to work with. It is very much an academic research codebase and reminded me of messy, large-scale software projects that I worked on when I was in graduate school. It comprises a backend of C++ code with which the user interacts via bash scripts.
There are innumerable "example" scripts available from a collection of so-called Kaldi "recipes." Coupling those with a few tutorials available online, a novice user can orient themselves and eventually, and cobble together their own custom bash scripts to perform inference on their own data.
Unfortunately, as I learned, Kaldi does not natively handle long-form audio, and so you must perform some audio pre-processing of your own. Here are the pre-processing steps one must undertake to work with Kaldi:
Transcoding the audio to 16kHz PCM
Pre-chunking it into manageable sizes (I used non-overlapping 30 second snippets)
Staging the chunks as flat files on the disk along with some additional metadata
Using Kaldi's command line interface to generate and stage audio features over your audio snippets
Once that bit of work is done, you are ready to run Kaldi inference. As you may have guessed, inference is also a complex multi-stage process where intermediate outputs are staged on the disk as flat files. If you are a novice user, you will inevitably make mistakes and run into issues getting it to work. Coincidentally, this is explicitly acknowledged in the first paragraph of Kaldi's README on GitHub, serving as a warning of sorts.
To compute accuracy results over whole files you will also have to write some custom post-processing logic to concatenate the chunk-level results after inference. Although I originally intended to benchmark the inference speed for Kaldi, inevitably it made no sense to do so because it took orders of magnitude longer than the other models to run and a non-trivial amount of time was spent figuring out how to use Kaldi.
wav2vec 2.0 facebook/wav2vec2-large-robust-ft-libri-960h
wav2vec 2.0 is an encoder model released by Facebook which was trained using a self-supervised objective on 60k hours of read audio books from the LibriVox project. It has several unique aspects which make it different from other open-source models, notably:
The architecture is unique in that it uses a "featurization front-end" comprising a stack of 1D CNNs which operates directly on 16kHz audio waveforms, downsampling them in time by a factor of 320x using strides. The rest of the architecture is a stack of vanilla transformer encoder layers.
The wav2vec 2.0 encoder maps the input audio to a sequence of quantized latent vectors that are generated by selecting entries from a codebook and where the selection operator is learned in training. This is in contrast to normal encoder models where the encoder output maps directly to a continuous latent space.
The wav2vec 2.0 base model was trained entirely on unlabeled data using a contrastive training task where a subset of the encoder outputs was masked, and then the network was trained to identify the masked values amongst a set of "fake" outputs (called "distractors").
The wav2vec 2.0 "base model," which is produced by self-supervised training, is not capable of performing ASR inference on its own. Extending it to perform ASR requires adding a "head" to the model that projects the encoder's output over a vocabulary of characters, word parts, or words. And then the modified model has to be trained in a supervised fashion on labeled speech data, typically with CTC loss.
Despite the notoriety associated with wav2vec 2.0, there are relatively few examples of open-source ASR versions available. For our comparison, we chose wav2vec2-large-robust-ft-libri-960h, produced originally as a result of this paper and now hosted and made available for ASR inference by the HuggingFace transformers library. The following summarizes some important details about this model's DNA and how we inference with it:
It is a CTC encoder model produced as a result of fine-tuning the wav2vec 2.0 base model on LibriSpeech (960 hours of human-labeled, read speech from audiobooks) using CTC loss. The model has only seen speech from audiobooks in its training history, which is a relatively narrow domain of clean, read speech. This has implications for model accuracy when processing noisy, conversational audio.
It has a "large-capacity" transformer encoder stack comprising 24 blocks, 1024 hidden size, 16 attention heads, and a feed-forward dimension of 4096. This makes it memory intensive on a GPU.
It has a character vocabulary and so it can make spelling mistakes in the absence of language model post-processing.
Since the model has only been trained and tested on pre-segmented data (i.e., short "clips" of audio), there is no established inference procedure by which to apply it to the long-form audio which we will use in our tests. As such, we have to make some decisions, particularly on how to do audio pre-processing and batching. In our tests, we transcode the audio to s16 PCM at 16kHz, split it into non-overlapping 30-sec chunks, and then inference on batches of chunks using the HuggingFace tooling. We choose 30-second chunks because this is the chunk size used in the original wav2vec 2.0 training.
There is no out-of-the-box HuggingFace support for applying secondary post-processing (i.e., CTC beam search or language model re-scoring) to improve the decoding of a wav2vec 2.0 ASR model's output. And so, we use a simple greedy method for decoding as illustrated in the HuggingFace docs. This means that the model will run at maximum speed in inference but will suffer in accuracy.
Since the model operates on raw audio waveforms, the input sequence lengths are extremely long (30-second chunks of 16kHz audio have 480,000 time steps). This, coupled with the model's large capacity, makes it difficult to run inference on GPUs without running out of memory. To mitigate GPU memory issues, we ran inference in half-precision mode and with a batch size of 1. The results of inference on chunks are decoded separately, using the model's tokenizer, and then the resulting chunk text is concatenated to obtain a whole-file prediction.
Whisper medium.en
Whisper is a family of encoder/decoder ASR models trained in a supervised fashion, on a large corpus of crawled, multilingual speech data. There are several unique aspects to its model DNA, discussed below:
Its architecture is "deceptively simple" and comprises a stack of 2D CNNs followed by a symmetric transformer encoder/decoder stack.
The model ingests 80-dimensional log-mel filterbank features derived from audio transcoded to 16kHz. These are relatively "standard" features.
Whisper models are available in several sizes, representing a range of model capacities. There is substantial variation in speed and accuracy across the capacity range, with the largest models generally producing the most accurate predictions but running up to ~30x slower than the smaller ones.
The Whisper source code takes care of audio pre-processing and can natively handle long-form audio provided directly as input. This makes it infinitely more usable than Kaldi, and slightly more usable than the HuggingFace implementation of wav2vec 2.0.
Whisper was trained in a supervised fashion on a very large corpus comprising 680k hours of crawled, multilingual speech data. OpenAI refers to the training as "weakly supervised" since the labels have not been verified by humans and thus are potentially noisy. The source and domain characteristics of the training data is unknown. Nevertheless, it's clear that the Whisper training corpus vastly surpassed that of our Kaldi and wav2vec models in terms of both scale and diversity.
Whisper performs multiple tasks (language detection, voice activity detection, ASR, and translation) despite the decoder only having a single output head. This is in contrast to Kaldi and wav2vec 2.0 which only perform a single task: ASR. The Whisper developers accomplished this by training the model on multiple supervised tasks and using special task-specific tokens which were added as first-class entries in the decoder's vocabulary and then included in the decoder's input text.
In ASR and translation modes, Whisper naturally adds punctuation and capitalization to its output. This is important for end users as it improves the readability of the transcripts and enhances downstream processing with NLP tools. The Kaldi and wav2vec models both produce output that is unpunctuated and in all caps. As a result, you may get the distinct impression that these models ARE YELLING AT YOU.
Whisper predicts "segment-level" timestamps as part of its output. Whisper developers handled this in the same way as different tasks, i.e., by including timestamp tokens as first-class entries in the model's vocabulary and inserting them directly at particular locations in the training text. As discussed in the next bullet, the timestamp tokens play a key role in Whisper inference. Kaldi and wav2vec models do not produce timestamps for words or segments.
Whisper employs a unique inference procedure that is generative in nature. The default behavior is to infer sequentially on 30-second windows of audio. The audio window is embedded with the encoder and then mapped to a predicted text sequence auto-regressively by the decoder, which uses the encoder output as a context vector. Whisper keeps the predicted text only up to and including the last predicted timestamp token and throws the rest of the prediction away. The audio window is then advanced forward to the location associated with the last timestamp and the process repeated, with the previous chunk's predicted text prepended to the decoder input as additional context.
Since it's a generative encoder/decoder model, Whisper is prone to some particular failure modes like pathologically repeating the same word or n-gram. This is mitigated during inference by re-inferencing on the same audio chunk with temperature-based sampling when the model detects that inference has failed.
For our testing, which is performed on English speech data, we use Whisper's medium.en model. We choose this size because it is equivalent to wav2vec2-large-robust-ft-libri-960h in terms of "expressiveness" in the sense that it uses the same encoder layer count, hidden size, number of attention heads, and feed forward dimension. Being an encoder/decoder model, Whisper medium.en is ~2x larger than the wav2vec model in terms of the number of parameters.
Evaluating Accuracy
Datasets
To compare the models, I randomly selected 50 files from Deepgram's internal validation sets for five domain areas:
High-quality human transcripts for each file are then used as ground truth labels to measure transcription errors.
Accuracy Metrics
In ASR, the most widely used metric to quantify ASR model accuracy is the word error rate (WER). Word error rate is based on the “Levenshtein distance” (or "edit distance") which measures the differences between two strings—in this case, a predicted transcript produced an ASR model and a human-labeled transcript. WER is defined as the number of errors divided by the total number of words in the ground truth.
Errors come in three forms: substitutions, insertions, and deletions. To see what counts as an error, let’s look at each one:
Substitution happens when a word gets replaced with another word (for example, “food” gets replaced with “good”)
Insertion happens when a word that was not said is added (for example “He is eating chipotle” becomes “He is always eating chipotle”)
Deletion happens when a word is left out of the transcripts entire (for example, “come here now” becomes “come now”)
Given a model prediction and a ground truth transcript, we perform an edit distance alignment between the two which determines the locations of substitution, insertion, and deletion errors. We then simply sum them up and divide by the total number of words in the ground truth, i.e.
WER can be computed at the level of individual files, or across entire datasets, giving you different views on how your model is performing. For our testing, we compute three summary metrics involving WER within each domain:
Overall WER: For this metric, we sum all the errors across files within a domain and then divide by the total number of truth words. This gives a "macroscopic" view of how the model is performing within each domain but can be skewed by strong performance on long files.
Mean WER per file: For this metric, we compute the WER for each file within a domain and then compute the average of file-level values. Depending on the domain, there may be a subset of files where a model performs quite poorly compared to the rest of the population. Oftentimes, these "problem" files are short in duration. In this case, the mean per file WER will be significantly larger than the overall WER.
Median WER per file: For this metric, we compute the WER for each file within a domain and then take the median over file-level values. This metric best reflects the "typical" performance of the model and thus, is probably best correlated with end-user experience.
Text Normalization
Before computing WER, it is common to apply some transformations to the model prediction and/or ground truth to try and minimize the adverse effect of formatting differences between the model's training corpus and the test data. This process is known as "text normalization."
Whisper has its own text normalizer which applies standard transformations such as lowercasing and punctuation removal, in addition to more liberal many-to-one mappings which operate on text spans like spoken digits, addresses, currency, etc. When Whisper's normalizer is applied to both the model prediction and ground truth, Whisper often enjoys a significant boost in WERs compared to other open-source models, as demonstrated in the Whisper paper.
For our tests, we computed results with both the Whisper normalizer and with a "simple" normalization scheme that only applies lowercasing and punctuation removal.
Accuracy Results
Overall WER
Whisper Normalization
Simple Normalization
Mean WER per File
Whisper Normalization
Simple Normalization
Mean WER per File
Whisper Normalization
Simple Normalization
Discussion of Accuracy Results
Kaldi Gigaspeech XL
According to all metrics, the Kaldi model produces pathologically bad WERs, irrespective of the domain or text normalization scheme. In this challenging setting of real-world long-form audio, we find that the conventional pipeline model simply cannot compete, even when trained on 10k+ hours of audio. Among the domains, Kaldi produces its best accuracy on Video data, as measured by the median WER per file. This is probably explained by the fact that the Video files are most similar to its Gigaspeech training data.
wav2vec 2.0 facebook/wav2vec2-large-robust-ft-libri-960h
Of the three models, wav2vec places squarely in second, producing vastly better WERs than Kaldi, but significantly worse than Whisper across all domains and metrics. If we define "usable" accuracy as sub-20% WER, then wav2vec produces usable accuracy only on Video data, according to the median WER per file. Performance in the other domains is significantly worse.
The effect of text normalization is mixed across domains and metrics with no systematic trend. Comparing the overall WER and the mean WER per file, we see that there is a large disparity in three out of five domains (Conversational AI, Phone call, and Meeting) indicating that for these datasets, the model has produced pathologically bad predictions on a subset of short files.
Whisper medium.en
Of the three models, the Whisper predictions are the most interesting, but also the least consistent across metrics. According to some views of the data, the Whisper model is highly accurate. For example, the Whisper-normalized median WER per file shows usable accuracy across domains, with highly accurate predictions on Conversational AI, Earnings Calls, and Video data. However, with simple normalization applied, the median WER per file picture is significantly less rosy. The overall WER, tells a completely different story, with the worst accuracy on Conversational AI data, followed by Phone calls and Meetings.
Like wav2vec, Whisper also exhibits a substantial degradation in mean WER per file on Conversational AI, Phone call, and Meeting data indicating pathological behavior on a subset of small files. As far as the normalization scheme, we find that Whisper normalization produces far lower WERs on almost all domains and metrics. This result is qualitatively similar to the results of the original Whisper paper.
To get a sense of the distribution of file-level results, we provide a box and whisper plot below over file word error rates for each model and domain. These results were obtained with the Whisper normalizer. All three models, including Whisper, have a subset of files that produce pathological predictions and very high WERs. The spread in accuracy for the models was so broad, that we found it necessary to use a log scale on the x-axis.
Evaluating Speed
Another important consideration when choosing an open-source model is speed. ASR inference has two major time components: Audio pre-processing and model inference. The model inference time depends on the model's architecture, inference algorithm, and capacity. It also depends, jointly, on the available computing hardware, i.e., whether you inference on CPU or GPU, and if on GPU, the particular GPU specs and allowable batch size.
Excluding IO costs, the largest time components associated with audio pre-processing are transcoding and feature generation, with the former being the larger of the two (transcoding time is usually 2-3x larger than featurization time). Because it involves both audio pre-processing and model inference costs, ASR inference speed is also dependent on the data you are processing, with the efficiency of most modern deep learning approaches being dependent on file length.
In our testing, we performed a 1-to-1 speed comparison between wav2vec 2.0 and Whisper over the five domains used in the accuracy comparisons. For each domain and model, we measured the total inference time associated with processing each file, including both audio pre-processing and model inference times. We then summed the cumulative inference time and cumulative audio duration over all files and computed a speed measured called "throughput" or "real-time factor", defined as
throughput = audio duration / inference time
Throughput represents, intuitively, the number of audio hours processed per hour of inference time.
Speed testing was carried out on two different NVidia GPU types: 2080 Ti and A5000. In addition to measuring throughput, we also made point measurements of GPU memory usage and GPU utilization rate for each file using device queries from the Nvidia Management Library (NVML). To minimize the effect of audio pre-processing differences between wav2vec 2.0 and Whisper, we used Whisper's load_audio function to transcode audio for wav2vec 2.0. This function is simply a wrapper around ffmpeg and generates compatible 16kHz audio for wav2vec 2.0 using its default settings.
Inference with both models was carried out in half precision mode. Batch size is another important parameter. Whisper only inferences on single samples and so its batch size is 1 regardless of GPU type. For wav2vec 2.0, we use the largest possible batch size permitted by the GPU before going OOM. For the 2080 Ti, we were limited to a batch size of 1 while for the A5000 we were able to increase the batch size to 3.
The results of performance measurements are summarized in the tables below for 2080 Ti and A5000 GPUs respectively.
2080 Ti Benchmarking
Whisper
A5000 Benchmarking
In the performance results presented above, there are a few things that stand out:
wav2vec 2.0 is significantly faster than Whisper across all domains and for both GPU types. We measured ~15x to 40x throughput difference, depending on the domain
wav2vec 2.0 uses significantly more GPU memory than Whisper, even in the 2080 Ti test where they are both operating on the same batch size. This is interesting because Whisper has a larger cumulative capacity. It can partially be explained by the differences in the network inputs with wav2vec 2.0 operating on inputs that are 320x longer than Whisper.
Whisper has higher GPU utilization rates across most domains and for both GPU types. This simply reflects the fact that Whisper inference takes significantly more time on the GPU as a result of the auto-regressive nature of its inference algorithm.
The speed, GPU memory usage, and GPU utilization rates of both models are strongly data-dependent. This data dependence reflects a dependence on average file duration. Interestingly, the models display opposing inference speed trends. Wav2vec 2.0 throughput increases with average file length with minimum speed on Conversational AI and maximum speed on Earnings Calls. For Whisper, we observe the opposite
What We Learned Here
In our comparison, Kaldi is the clear loser in terms of usability, speed, and accuracy. In an open-source model comparison, this kind of clear result is the exception rather than the rule. It's more typical to face complex tradeoffs between models and this is precisely what we find for Whisper and wav2vec 2.0. Whisper is the clear winner in terms of accuracy, but it's more than an order of magnitude slower than wav2vec 2.0. Choosing between these two options would depend on which model better meets your needs.
It's also quite possible that none of the available open-source models meet your speed or accuracy needs. A great deal has been made about Whisper's accuracy, and we find it to be particularly strong on earnings calls and video clips. But what if your use case involves a domain where Whisper accuracy is poor, such as noisy phone call audio? Or what if you require advanced features like real-time transcription or diarization? What if you have thousands of hours of audio to transcribe, and you don't have the luxury of waiting weeks for transcription to finish?
If any of these questions are relevant to your use case, then you should probably consider using a speech-to-text API like Deepgram.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.