The most exciting time to be in the Automatic Speech Recognition (ASR) space is right now. Consumers are using Siri and Alexa daily to ask questions, order products, play music, play games, and do simple tasks. This is the norm, and it started less than 15 years ago with Google Voice Search. On the enterprise side, we see Voicebots, Conversational AI, and Speech Analytics that can determine sentiment, languages, and emotions.
Early Years – Hidden Markov Models and Tri-Gram Models
The history of Automatic Speech Recognition started in 1952 with Bell Labs and a program called Audrey, which could transcribe simple numbers. The next breakthrough did not occur until the mid-1970 when researchers started using Hidden Markov Models (HMM). HMM uses probability functions to determine the correct words to transcribe. These ASR speech models take snippets of audio to determine the smallest unit of sound for a word or what is called a phoneme. The phoneme is then fed into another program that uses the HMM to guess the right word using a most common word probability function. These serial processing models are refined by adding noise reduction upfront and beam search language models on the back end to create understandable text and sentences. Bean search is a time-dependent probability function and looks at the transcripted words before and after the target word to find the best fit for the target word. This whole serial process is called the “tri-gram” model, and 80% of the ASR technology currently being used is a refined version of this 1970’s model.
New Generation of ASR – Neural networks
The next big breakthrough came in the late 1980s with the addition of neural networks. This was also an inflection point for ASR. Most researchers and companies use these neural networks to improve their current tri-gram models with better upfront audio phoneme differentiation or better backend text and sentence creation. This tri-gram model works very well for consumer devices like Alexa and Siri that only have a small set of voice commands to respond to. However, this model is not as effective with enterprise use cases, like meetings, phone calls, and automated voicebots. The refined tri-gram models require huge amounts of processing power to provide accurate transcription at speed. Businesses need to trade speed for accuracy or accuracy for costs.
New Revolution in ASR – Deep Learning
Other researchers believed that neural networks were the key to having a new type of ASR. With the advent of big data, faster computers, and graphical processing unit (GPU) processing, a new ASR method was developed, End to End Deep Learning ASR. This new ASR method could “learn” and be “trained” to become more accurate as more data is fed into the neural networks. No more developers re-coding each part of the tri-gram serial model to add new languages, parse accents, reduce noise, and add new words. The other big advantage of using an End to End Deep Learning ASR is that you can have the accuracy, speed, and scalability without sacrificing costs.
This is how Deepgram was born; out of research that did not look at refining a 50-year-old ASR model but starting from deep learning neural networks. Check out the entire history of ASR in the image below.
Contact us to learn how you can decrease word error rate systematically without compromising speed, scale, or cost.