Introduction

What’s funny is that if you leaf through some of the earlier papers on end-to-end AI speech recognition, the models featured can’t seem to outperform hidden Markov model-based (HMM) approaches. Sure, in the past few years, end-to-end ASR finally got it together to beat out the old timers–strapping together resource-intensive components like attention mechanisms, to build state-of-the-art models. However, on that note, a fire extinguisher still puts out a candle better than a light puff, but at what cost? 

I think we may have been a little quick to leave the classic HMM approaches by the wayside. I mean, just do a quick google search for tutorials on putting together your own ASR application, and you’ll sooner see recommendations to roll out RNN-transducers than any tried-and-true, HMM-based systems. 

The point is, for most of what ASR is needed for, HMMs and hybrid systems get the job done. In fact, the performance gaps of these older, more modestly-sized models essentially amount to a few edge-cases.  So why aren’t we getting hyped about them?

In this article, I want to talk about hybrid machine learning architectures and models, especially HMM-based models. Why don’t they get more attention?

I’ll warn you that lots of what I’m saying is speculative and anecdotal. I want to open a conversation. I’d like to consider, from a developer’s perspective, why a casual developer wouldn’t intuitively take this route, or might be discouraged from doing so.

I’m also not looking to get into the technological nitty-gritty. If some of the terminology seems out-of-scope, fret not. You don’t need to understand emission probabilities and hidden states to grasp “old-tech to new-tech” and “big-models to small-models.” This isn’t a story about machine learning models. This is a double-take at the ceaseless pressure to unquestioningly innovate, grow, and overtake.

I’ll take you through some of the reasons why I think HMMs aren’t seeing much love: notions I had that made me shy away from them, and obstacles that kept me from trying them out. I’ll also propose some ways to get people excited about them.

There are a couple of machine learning educators trying to encourage us to avoid following the hype. These folks—like Jeremy Howard and Rachael Tatman to name a few—are advocating that we prioritize practical, efficient, technologies over whatever’s new, flashy, and has the quirkiest name. I’m here to do the opposite. Let’s fight fire with fire. Big tech wants to make some twenty-quadrillion-parameter language model? Let’s make HMMs instead 😎

What’s Stopping Us

🪨 The Pipeline is Clunky

At least, that’s how it seems. I was just starting out in the linguistics world during a time where the need for pronunciation dictionaries had been essentially obviated. So I totally understand that adopting a somewhat tedious and seemingly outdated workflow is a lot to ask.

Nevertheless, there’s no reason hybrid technologies couldn’t be a fixture in the default pipeline. I mean, they sort of already are– the fact that PyTorch’s Wav2Vec2.0 implementation comes packaged with KenLM support speaks for itself.

But the technology isn’t inherently more complex, it’s just packaged in a much less practical way. PyTorch, Tensorflow, Keras and the like have spoiled us with the idea that a single library can cover our linear algebra, automatic differentiation, and architectural needs with just a few lines of Python code. (At least, that’s what the quickstart tutorials will have you believe.)

🐍There isn’t a Unified Python Library

What’s the Python Library for modern machine learning? PyTorch. Tensorflow. Keras

You can even choose along a sliding scale of abstraction. Transformers is PyTorch wrapped nicely with a bow.  It’s abundant but overlapping. If you’re looking at older techniques, you sort of run into the exact opposite problem.

More specifically, if you want to build an HMM-based model, the best you’ll find is hmmlearn, which looks swell, but that hardly covers all the basics.

For one, these technologies are easier to use on every level: they’re easier to find, more abstracted, and better documented. Think about what it takes to crack into the transformers library. Sure, the learning curve can feel steep, but once you’ve gotten the hang of it, you can tackle the rest of the library. Even breaking into Pytorch—that remarkably denser and low-level library—at least comes with the promise of giving you the tools to practice end-to-end machine learning across the board.

A couple of pip install commands is a picnic compared to untangling your way through the mess of dependencies needed to up-and-run Kaldi, an HMM-based framework. Kaldi's documentation describes the toolkit as having “modern and flexible code, written in C++, that is easy to modify and extend.” 

That’s all nice and well for 2013, but things have changed. There’s a reason the name “C++” looks like a key: writing C++ is locked behind the iron gate that is knowing C++. I can sympathize with the possibility that this code was written in a time when ASR was for ASR engineers, but, again, things have changed.

The wonderful thing about high-level libraries like PyTorch is that they’ve made machine learning available in a world where programming is unprecedentedly accessible to the non-engineer. It’s time for the old-timers like Kaldi to catch up.

To summarize: when you learn Pytorch, you learn machine learning. When you learn Kaldi, you learn Kaldi.

🧠 Switching the Paradigm: HMMs aren’t neural networks

It’s worth noting that every HMM-based implementation is just that: an implementation. There will always be decisions made under the hood that are out of our control– whether it be for reasons of performance, practicality, or just in accordance with a certain practice. No matter the reason, though, we will have to forfeit a certain amount of control. That’s the tradeoff of user friendliness, after all.

The issue with these HMMs libraries is less a matter of the code being available, and more so about the fact that it’s not packaged in a friendly way. Doing a computer vision problem, you’ll call a cnn module. Doing a sentiment analysis problem, and you’ll either use a default function in some LLM package, or you can implement it yourself in a few hours with the aforementioned Pytorch. But doing ASR or implementing a chess-bot, and you’ll find yourself tangled in a mess of documentation and a mix-and-match, no-rhyme-or-reason set of HMM-related functions from different libraries sourced from various corners of the internet.

😊 Conclusion

Older machine learning algorithms are efficient and effective, but they’re not exciting. In my opinion, whether they work, or whether they work as well as newer approaches is moot– these metrics are based on arbitrary benchmarks, and mostly different by a matter of a couple tenths of a percent.

HMMs aren’t necessarily practical, and the performance savings don’t feel significant until you’ve reached a production level– a pipe-dream for most folks behind a computer screen scrolling through Toward Data Science listicles.

However, we can’t just toss these HMM methods into the trash. We’ve proven that they work. We just need to iterate on them. 

More specifically, if we want to modernize HMM-based approaches for the post-AI-boom world, we would need:

  • An imperative codebase for building these models.

  • A unified codebase

Again, this for now is a pipe dream. However, what I’m proposing, for the time being, is to approach machine learning with the cheap-dad-buying-his-teenager-a-used-car mentality: let’s see how the older models work. Let’s see where the gaps are. And then we can start talking about upgrading. Don’t like how the older model looks? Then let’s rebrand it: HMMs are classic, vintage. And I hope your next ASR gizmo isn’t too shy to pull up to the quad in one of these retro fits.

Bibliography

[2112.12572] Are E2E ASR models ready for an industrial usage?

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo