🤔 Evaluating speech-to-text solutions? Try the STT self-assessment 📊

All Posts

What makes Alexa, Siri and HAL tick

Before computers can rule the world, they first have to recognize the words we use

Voice to text on Siri, Amazon Alexa and Google Home, Apple Homekit and HAL from Space Odyssey 2001 are all examples of computers that listen to us and do what we say -usually. Devices with Voice User Interfaces (Voice UI) seem to have popped up everywhere —and that is only natural! After all, we evolved to speak, not to type, use a mouse or even a touchscreen.

HAL from Stanley Kubrick’s A Space Odyssey 2001 via GIPHY

Voice UI is inherently a logical way to interact with computers. Consider that 1968 is the year that the movie A Space Odyssey 2001 which features HAL, a sentient speaking-listening robot, is also the year that the computer mouse was made public. It is fascinating to think that as the computer mouse was debuted, itself a quantum leap in human-computer interaction, Stanley Kubrick understood that the real way to interact with a computer is with human speech.

What it takes to get a computer to understand you

When it comes to speaking to computers in a human way (ways similar to how you and I speak to each other), there are essentially two problems to be tackled:

  1. Getting computers to recognize our words.
  2. Getting computers to understand what we mean with these words.

Since I can’t even get my friends to understand what I mean when I say please be nice to me, you can see that bringing (2) to reality is pretty hard. But, we are thankfully getting better at doing number (1).

In humans, the ability to speak/listen to language co-evolved with being sentient, decision making beings. So, the question to be asked is: can one exist without the other? The short answer is yes.

Recognizing humans’ words is the key to more advanced human-computer interaction

The technologies behind all the Voice UI enabled products is Automated Speech Recognition or ASR for short. ASR technologies (there are many types, each with their benefits and limitations) allow Alexa and Siri to recognize your words -usually. Without ASR, none of these technologies would be possible; no Siri, no Alexa, no Google Home and no HAL, Johnny 5 or Rosey from the Jetsons!

Interestingly, in the movies almost every super-computer (usually sentient, almost always evil) has the ability to recognize human speech. Put another way, examples of sentient computers that don’t have some sort of (putative) Automatic Speech Recognition hardware/software are very rare. However, one example stands out: director John Badham’s War Games (1983). In the movie, though the computer is sentient, the computer can’t recognize human speech. This leads to a lot of shots of a CRT monitor with blue on black type and a blinking cursor. Matthew Broderick’s character must type into the computer -the computer cannot understand Matthew Broderick spoken words.

If I have to type to speak with my computer overlords, let the interface be this retro via GIPHY

Because today, in the real world, we have not solved the problem of getting computers to understand us, Voice UI technologies have us speak formulaically to them. Critically, current Voice UI can’t handle more than one or two turns of speech. What would any human conversation be like if they each consisted of one turn of speech?!

Consider that in mid-2018, the Google Duplex demo astounded the world by demonstrating a computer that can handle more than one turn of speech. How did Google get us one step closer to creating a HAL-like personal assistant? Better ASR is one critical aspect.

Pushing ASR forward makes better Voice UI

ASR technologies allow computers to recognize our words, with an increasingly lower error rate, but that does not mean that the computers understand what we mean. Because of this, the future of Voice UI and, ultimately, creating sentient computers, is tied to advancements in ASR technologies. For example, as ASR engineers work on making more accurate speech to text (machine transcription), they develop technologies which take speech context into account.

What does “take speech context into account” mean? It means that in creating transcripts of audio data automatically, we are teaching computers to consider words in the context that they are used, just like humans do.

For example, the words that you find around the word “tree” are very different depending on what the conversation is about. It could be a conversation about botany, a conversation about family, or one about linguistics. To create better audio transcripts tailored to our needs, and to get super evil sentient computers, we have to teach them about context.

Simply put, ASR makes Voice UI possible

Voice UI, in turn is simply a more natural, but still contrived way of getting computers to do our bidding. Long before computers can be taught to think on their own -or even after this is achieved- computers will need to reliably understand human speech. If the ASR technologies that allow computers to recognize our words are inaccurate, Siri, Alexa and Google Duplex will risk scheduling us for hair appointments at the local ice-cream shop, or worse, have Tesla drive us to our ex’s house rather than the Exeter hotel. As ASR and other speech technologies improve, the way we interact with the Alexas and Google Homes of the future will become less scripted and more human-like.

If computers can recognize our words, they can learn from us too via GIPHY

ASR is the first step to getting information out of speech

In 2018, the use cases of Voice UI seem to limited to shopping online, dimming lights at home, and sending unintelligible, passive-aggressive texts with your phone’s voice-to-text feature. By contrast, the use cases for ASR are far more diverse.

  • Insurance companies use ASR-based technologies to stay compliant with a myriad of government and industry regulations -saving millions.

  • Sales enablement platforms use ASR technologies to deliver unprecedented insights into sales calls for coaching purposes -earning millions.

  • Call centers and companies with large call volumes leverage ASR technologies to automatically find meaningful information hiding in their hundreds of thousands of hours of call recordings.

In graduate school, this author would love to have used ASR technologies to save himself hundreds of hours of (surprisingly inaccurate) transcription for research. ASR would have given me more time to analyze the data and do something which I am told other students did: socialize.

ASR technologies help us find and make sense out of all the information hiding in the vast stores of human speech that we have amassed since the invention of sound recording. The improvements made in ASR over the last decade have brought the word recognition error-rate down to relatively low levels, giving birth to products like Siri and Alexa.

As ASR technologies improve, not only will consumer products like Google Home perform better, but so will the hundreds of behind-the-scenes ASR implementations that make modern companies competitive. When well used, ASR technologies can give us insights into our audio data, our companies’ performance, customer behavior and so many other areas.

The list of potential uses of ASR is nearly endless but definitely includes sentient robots and adventure-packed interstellar travel.


Apply Now

Receive up to $100,000 to use over 12 months.

Become a Partner

When you become a partner you’re in good company.

Talk to Customer Success