Article·AI & Engineering·May 13, 2023

Voicebots Will Enhance Your Life (Not Destroy It)

Keith Lam
By Keith Lam
PublishedMay 13, 2023
UpdatedJun 13, 2024

We have been lucky to have two interesting and informative panel discussions on the evolution, current state, and future predictions of Conversational AI voicebots.

Our panelists included some of the pioneers of voicebots like Al Lindsay, former VP that lead the technical creation of Amazon Alexa, and companies that are creating a more human-like voicebot experience: Bitext, Elerian AI, OneReach AI, Uniphore, and Valyant AI. Our discussions were not about the Command and Respond type smart speakers (Amazon Alexa, Siri, and Google) but on applications that sound, feel, and respond like speaking with a human; where you feel you can have a more natural conversation with the machine. One case study example was a customer that continually said "Thank You" to the AI voicebot because the interaction was so human-like. Some interesting takeaways discussed:

  • Implementing a human-like voicebot is difficult and you have to be committed

  • Currently, there is no unified voicebot that can speak on any topic, in all dialects, and in all languages, so each voicebot must be built for a specific use case and domain.

  • The foundation of human-like voicebots is having a great speech-to-text application.

  • Conversational AI voicebots cannot replace humans and will enhance their jobs

Commitment to Voicebots

Plug and play voicebots don't exist yet as each voicebot must be trained to the domain and use case of the organization. For example, you can't use a voicebot trained for banking in England for food ordering in the U.S. The voicebot will not understand the accents, dialects, and terminology of the customer nor can it respond correctly. And you may want your text-to-speech engine to have a British accent instead of an American one. Remember, these conversational AI voicebots are not simple apps giving directions where it only has to respond with 20 different sentences. Kevin Fredrick, the Managing Partner of, expressed this best when he said, "Building a Conversational AI voicebot is like planning to summit a mountain. Those who are looking for an 'easy button' get frustrated and quit. The ones who think it will be too hard, don't ever start. It is the ones who know the challenge is worth it and have the right partners and use the right tools who make the summit." Although it is not easy, the rewards and return on investment when done correctly are worth it. Better customer satisfaction with less waiting on hold, faster answers to customer's questions, more opportunities for add-on sales, and freeing your human agents for more difficult tasks are some of the outsized benefits.

A Unified Voicebot

Unlike the movies, there is no Conversational AI voicebot that knows everything. If you have seen the movie, "HER," Samantha may be the ultimate personal voicebot. Unfortunately, even with cloud computing, big data, and ultrafast connections, we cannot put all knowledge into one voicebot. For some voicebot companies, that may be the ultimate goal. Think of a voicebot that you can have a conversation with about finance and soccer and it remembers the previous conversations you have had and refers back to them. It can combine what it learned from current events, previous conversations with others, and provide you recommendations and responses. Currently, you need one voicebot trained for one domain and use case. But, as Antonio Valderrabanos, CEO of Bitext indicated, you may be able to combine a multitude of voicebots to get larger ranges of knowledge and conversations. So, how do we get there? Our experts think it will take both a large breakthrough innovation and a bunch of smaller innovations along the entire voicebot workflow from speech recognition to text to speech to create the unified voicebot. ​​So, when you find yourself getting upset with Alexa or Siri, remember they are still a long way off from Samantha. They're only capable of so much.

Great Speech to Text is the Foundation

Dion Millson, CEO of Elerian AI summed it up best when he said, "For Conversational AI voicebots, it all starts off with speech recognition, if you don't understand what the person said and transcribe it to text accurately, you are not in the game. Unfortunately, the general ASR models standardize around 70% accuracy, and it is just not good enough to respond to a caller with real-time accuracy and relevance. Our partnership with Deepgram and their models in conjunction with our internal models that are trained on case-specific data get well over 90% accuracy." He further said that some words are much more important than others in a specific use case. For his banking customers, the account numbers, phone numbers, and government ID numbers are vitally important for the voicebot to provide the right response. You cannot be 70% accurate on these keywords, you need to be closer to 100% correct or the whole system fails. Inaccurate transcriptions sent to the artificial intelligence knowledge base will lead to incorrect responses or a "please repeat that " request. The foundation must be a highly accurate speech recognition solution that can be trained on the keywords for that use case.

