In the spring of 2017, Google announced that their voice recognition WER (word error rate) had fallen to just 4.7%-putting it on the par with human transcriptionists. "Wow!" said the world. Highly accurate speech recognition is changing how we interact with computers, how we advertise and so much more. Too bad that a 4.7% WER is advertising bunk. Several large companies have announced similarly low rates-and these low rates are real, but here's the catch: They managed to get human-level accuracy by training their ASR (automatic speech recognition) systems on a small language corpus, like the National Switchboard Corpus.
The National Switchboard Corpus is a well used (overly-used) database of phone calls that have been carefully transcribed for linguistics research. When companies announce that their new speech recognition system has impossibly low word error rates, it's because they are trained and validated on this very limited data set. No company has yet to reliably deliver 4.7% accuracy on everyday audio-the sort of audio that comes through call centers and cloud conferencing companies and needs to be transcribed.
In fact, even the most highly trained (and expensive) human transcriptionists would struggle to get 4.7% accuracy on regular "wild" audio data. When data scientists look at different speech recognition APIs (ASR products), they evaluate them according to several metrics, WER being a principal one. Yet, WER is not a perfect metric. This is because WER is strongly affected by:
Different kinds of noise
How transcripts are (or aren't normalized)
1. Noisy Voice Data Lowers WER
Have you ever wondered why the NATO phonetic alphabet exists? The NATO phonetic alphabet (NATO-PA for short) is the cool words that pilots use to communicate (both in real life and in the movies). You've heard it:
"Delta bravo eight nine six cleared to land two six right."
The NATO phonetic alphabet was created to address the very same problem that causes WER to vary greatly for any one system: the variety of accents in language and the noisiness of communication methods make comprehension difficult. In 2018, top-performing call center representatives all know the NATO-PA because the phone-POTS and VoIP-are noisy, compressed communication media.
Today, we have a variety of voice communication channels, but we still face the same problem that WW2 pilots faced: The phone systems our customers call us on introduce a lot of noise into the voice data. The noise may come from problems in the line, compression, and ambient noise.
People call businesses from office phone lines, from their cell phones, even occasionally from a sat-phone. All these systems introduce noise from bad or weak connections: bad jack, bad signal, squirrely satellite. All the long-distance communication systems introduce noise somehow and can seriously hamper ASR systems. But, line noise is not the only form of noise that causes problems for transcription.
All voice communication systems, VoIP, chat, Whatsapp and so on compress their audio. Compression means that the phone systems are designed to be more efficient by removing a lot of information. That makes talking cheap, but it also means that you can't understand as much. Have you ever tried understanding a new word over the phone? Impossible, no?
Few of us have a landline anymore. Plus, all of us are very busy. This means that when we do call the companies we buy from, we are likely calling from noisy environments.
Echo-y conference room
To make matters worse, there is no way to control for the quality of the microphone or how far away people speak from the conference room microphone. The fact is, most real-world data is going to be noisy. Therefore, when using a speech recognition system to decode your voice data, make sure it's trained on your kind of noisy voice data.
2. Crosstalk: Humans Come in Pairs
There is another form of 'noise' that really dizzies brains and off-the-shelf speech recognition systems: crosstalk. As long as we are in the room, know the topic and the speakers intimately, it isn't hard to follow a conversation when two people talk over each other. You may not even realize when someone starts their sentence before you can end your's. However, recordings of human conversation are hard to follow and even harder to transcribe. Most of the difficulty arises from crosstalk-the moment when two people speak at the same time. Crosstalk is incredibly common-it is a natural and somewhat unavoidable part of human conversation. Due to the fact that humans and machines have trouble sorting out the words said in crosstalk, you get unpredictable results when it gets transcribed.
How do you represent simultaneous speech in text?
What if crosstalk makes things inaudible for humans and machines?
Is it more important to transcribe one speaker versus another?
There are few standards developed to deal with the "crosstalk error." As a result, what gets transcribed is done in many different ways: sometimes correctly but in the wrong line (so it looks like deletion AND substitution errors at the same time) or completely omitting one speaker's words entirely. Naturally, such confusing standards would confound ASR and greatly impact WER. Data scientists should therefore consider how much crosstalk there is likely to be in their data. Web conference audio data likely has a lot and YouTube recipe videos likely have little, unless it's Giada attempting to teach Nicole Kidman and Ellen Degeneres to cook.
When choosing a speech recognition API, look at where the errors occur. If the crosstalk errors are not important, consider re-calculating the error rate for yourself.