Generic ASR will never be accurate enough for Conversational AI
The human brain is amazing in terms of how we can process speech and understand what is said. If we are talking about a baseball game, your brain understands that when I say "pitcher" and "batter", I don't mean a large vessel for pouring drinks and a mix to cook pancakes. Your brain matches the words to the context and the intent of the conversation. Your brain also has an amazing noise filter to focus on the important parts of a conversation. If you are at a baseball game, there is constant noise around you but when your buddy talks to you, you can focus on his voice, hear him and understand him clearly.
How does a Conversational AI system determine the intent of the conversation and focus on the important words? Let's talk about a possible future Conversational AI example. Imagine a robot waiter at a local pub. There are four conversations going on around it. The booth to its left is talking about a weird internet video. A table behind it is complaining about the last place the group ate and how bad the chicken was. And finally, the table in front of it has delegated the task of ordering appetizers to the person at the back of the table, with everyone throwing their requests their way. Given a one hundred percent accurate transcript of audible conversation at the table it would be really hard for the robot to understand what should be happening here. Did they just order chicken tenders or was that the other table? Was that two orders of the appetizer or was that first person asking the other person to order it? Was that 'mh-uh' a no they don't want the biggie sized version or was it just a throat clearing?