Article·AI & Engineering·Oct 20, 2023
8 min read

Achieving Real-Time Solutions: The evolution of ASR

8 min read
Ben LuksJose Nicholas Francisco
By Ben Luks and Jose Nicholas Francisco
PublishedOct 20, 2023
UpdatedJun 27, 2024

At this confluence of immense computational resources and seemingly-endless availability of data, we’d be hard pressed to find a reason to train models from the ground up. Following from the trend of Large Language Models (LLMs), the last couple of years have ushered in the era of pre-training: big tech deploying unthinkable resources to train models, which could make quick and unprecedentedly effective work of our trivial Natural Language Processing (NLP) tasks. The 2020 release of Wav2Vec 2.0 and corresponding Hugging Face wrapper to support it signaled that pre-training was planting its feet in the field of Voice Technology. For a while, that seems to have been how Voice Tech was done. It was the most convenient, it was the best, and given that no one I knew in my newly accredited master’s cohort was scaling to 10,000 monthly users, there wasn’t a reason to use anything else.

 Release after release, tech giants are slapping us with the “newest”, “biggest”, or “most efficient” new model. I can tell you that, at least in the case of NLP, picking the right option is getting to be a bit overwhelming in this ever-growing marketplace of unpredictably capitalized, cartoonishly named models. The release of Whisper was my first whiff of that in Voice Tech. Before we were all flooded, I wanted to take this opportunity to discuss the differences between Wav2Vec 2.0 and Whisper. What gap in the former necessitated the release of the latter? Are their differences enough to carve a place for both, or is this a case of each entity scrambling to snag a seat at the table?

What this article isn’t is a technical appraisal of the two models’ architectures. In my opinion, machine learning discourse doesn’t spend enough time focusing on the practical implications of architectural specifications. Sure, knowing about attention heads and regularization techniques makes for great wine talk, but I’m after something the frustrated novice can make sense of. That’s what I’m hoping to give you here: The People’sTM breakdown.

And by the way, if you want to check out Whisper (and Deepgram!) in action, check out this API Playground. You’ll be able to compare and test various AI Speech Recognition models without having to write any code. 😉

📖 For some very brief background…

Facebook (now Meta) research introduced Wav2Vec 2.0 in 2020. It was trained on, like, a bajillion hours untranscribed data. The task was to learn to reconstruct the audio from an intermediary representation. The representations don’t do much on their own, but when combined with BERT, can be remarkably easily leveraged for so-called “downstream” tasks. The performance is staggering using little data.

Whisper’s claim to fame is what the authors call “weak supervision”. It was trained on all transcribed data, just, in a lot of cases, not very well transcribed data. That’s what allowed the authors to get so much of it. The thinking was along the lines of: there are infinitely many ways to get something wrong, and only so many ways to do it right. Aggregating enough data, flawed as it may be, will only be consistent in as much as it’s accurate, yielding a model that only generalizes to accurate results. This life-finds-a-way philosophy proved effective, and the model achieves near human-level results on numerous benchmarks.

⚖️ Comparing Whisper and Wav2Vec 2.0

I’ll be skipping the obvious here. Should it so interest you, you can look into considerations of size, speed, and the clear difference in training approaches (like the fact that Whisper was trained on speech translation data). 

Wav2Vec 2.0, as the article’s title makes explicit, is a framework. It represents a paradigm shift in the approach to ASR. Whisper, on the other hand, champions itself as the customizable alternative to Wav2Vec 2.0. For all of Wav2Vec’s glory in circumventing the need for labeled data, the Whisper authors are very deliberate about expressing the limitations of Wav2Vec’s encoder-only pre-training. How often you’ll run into problems of this sort, I couldn’t tell you.

Wav2Vec 2.0 is a great tool for gazing into the future and waxing about technologies to come, as your standard-issue MMA, comedian/podcast-hosts so love to do. But here on the ground, we rely on working technologies based on tried-and-true approaches to keep the gears turning.

Wav2Vec 2.0 is great for anything I’ve ever needed it for. These include trivial benchmark demonstrations and a low-brow article intended to walk you through running a quick transcription in a Jupyter environment. As far as any specific needs go, I can’t speak to that. It’s for this reason that, in spite of my skepticism about its supposed air-tight constraints, as the Whisper authors suggest, I’m compelled to give them the benefit of the doubt until I have the experience to argue otherwise.

🗞️ So, What’s New?

We’ve talked about what makes Whisper different, but how is it new?

Well… it’s not. 

But okay, let’s be specific. When compared with Wav2Vec 2.0, Whisper is indeed very different. However, compared to your standard, encoder-decoder based acoustic models, Whisper is simply just another iteration of old technology. That is, from an architectural standpoint, Whisper is a beefed-up version of AI that already exists. And that’s the beauty of it. 

ASR models have been made available since before a standard tech-stack existed for running, fine-tuning, and deploying ASR models. This availability implies, for instance, running Baidu’s DeepSpeech meant getting yourself tangled in a mess of dependencies and terminal commands. That’s not to mention training the model, in which case you’re looking at file-structuring and formatting protocols akin to learning a whole new programming language. It’s annoying. Not the end of the world– but still annoying.

Like Wav2Vec 2.0, Whisper touts a training regiment that expands its training data. But unlike its Wav2Vec 2.0, Whisper trains within a familiar architecture. 

What’s the upshot? Well, long story short, using Wav2Vec 2.0 would be a complete headache compared to using Whisper. Using Wav2Vec 2.0 entails having to swap out the (adequate) ASR model that already lives on your device with the Facebook-produced framework. That’s a hefty migration. And as any software engineer knows, migrations cause migraines. On the other hand, because of Whisper’s familiarity, using it could look like a What’s New in Version X.X.X.X? footnote.

Furthermore, what isn’t new is a model that can run on a common framework. I mean, that’s the point of common frameworks, isn’t it? What Whisper is doing is leveraging the eras of plug-and-play models all in our favorite libraries. It’s our once-and-for-all model: a de-facto standard for which documentation, tutorials, and parameters pre-trained on a variety of datasets already exist. This isn’t new to machine learning, and it’s not different from what we were offered with Wav2Vec 2.0. But it comes without the baggage of having to uproot your technological approach. Those of us who prefer the scenic route and get our hands dirty are welcome to crack into the likes of SpeechBrain, but that misses the point.

This isn’t an accident. The fact of being not only open-source but parceled as an easy-to-use Python package makes it available to the user at every level of the abstraction totem pole: simple Python scripts to a HuggingFace hub to Native PyTorch. It lets you collapse the feature extraction/model/decoder distinction, which can feel especially arbitrary and frustrating for newbies. Look out for it in your own favorite ASR service, open-source or otherwise.

🚀 From Whisper to Nova-2: Even more progress

Whisper, again, is built on a familiar architecture with familiar training data and a familiar finetuning process. Nova-2, however, is more accurate than Whisper, faster than Whisper, and less expensive than Whisper.

But how?

Well, Nova-2 is the result of a near-decade’s worth of iterations on patented AI architectures that deviate and improve upon the classic Transformer architecture we’re all familiar with. And if you’d like to see the results, check out the video below:

🍲 What it all boils down to… 

To summarize: Wav2Vec 2.0 offered the approachable HuggingFace packaging, in all its customizable glory, at the cost of buying into an entirely new paradigm. Whisper gives you the option to expand on the standard approach and (maybe..?) consolidate your tech stack. Is this the beginning every run-of-the-mill acoustic model is going to have its own repo? Part of me sort of hopes not. I already get overwhelmed by the number of available models for NLP tasks. I think we could do more to divorce the minutiae they worry about in academia from meaningful practical differences, as far as they concern the day-to-day of trivial tasks.

I wonder if the entire premise of this article is a fair comparison. Sure, the authors spend a good few paragraphs motivating the need for the Whisper approach, given the supposed limitations of Wev2Vec 2.0. But is that the genuine motivation, or an act of technical due diligence, signaling an awareness of the state-of-the-art?

The point of this article was to explore a new way of talking about the difference between technologies. Whisper is different. How you can get it to work, its size, and what you can do with it are different from Wav2Vec 2.0. I’m not suggesting otherwise. I do, however, want to note that a big difference in the context of research will, in an overwhelming number of cases, be trivial in the day-to-day. Mind you that significant means a very different thing academically, than it does colloquially. Take that to mean what you will, then go test out Whisper and Nova-2.

Note: If you like this content and would like to learn more, click here! If you want to see a completely comprehensive AI Glossary, click here.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.