Article·AI & Engineering·Aug 27, 2024

The Best (and Worst?) of Voice AI Technology

Tife Sanusi
By Tife Sanusi
PublishedAug 27, 2024
UpdatedAug 27, 2024

Earlier this year, OpenAI announced the release of their new flagship model, ChatGPT-4o. This new model is able to see, hear, and accept prompts through text, audio, video, image or any combination of them. The model’s capabilities were demonstrated in a series of demos but some of the most impressive moments were the demos that involved voice AI. In one of those demos, we watch ChatGPT-4o discuss and solve math problems with Sal Khan, founder of Khan Academy, a company that provides free online courses for students, and his son. Throughout the demo, we watch ChatGPT-4o talk through the math problem while following the prompt of not solving the problem but rather nudging in the right direction.

Voice AI has come a long way from its humble beginnings and is now a rapidly growing technology that can be integrated into almost every work process or personal interaction. Voice AI’s versatility means that it can be used across numerous industries and teams in unique ways. Voice assistants are one of the most common applications of voice AI with virtual assistants like Siri, Alexa and Google Assistant helping to nake the lives of millions of users all around the globe easier. In finance, bankers and other finance professionals use the technology to customize client interactions and increase fraud detection rates while doctors and other medical personnels are using voice AI to schedule appointments for patients and automate treatment plans. 

Because of the relatively speedy evolution of voice AI technology, it can be easy to take the technology for granted but every once in a while, a demo comes along that shows just how fascinating and impressive voice AI can be. These demos range from weird to cool but one thing they all have in common is leaving us with a collective appreciation for voice AI. Here are some of my favorite demos of voice AI technology.

Myself, I Am and That

Voice AI has been a valuable tool for some creators and artists who integrate technology into their work. Youtubers, Tiktokers, and podcast hosts are using voice AI to create and voice their videos or posts, but Lior Sol is taking it one step further. The sound engineer created a podcast called Myself, I Am and That that featured meta conversations with himself. Using ElevenLabs, Lior created a clone of his voice and then a clone of that clone. During podcast episodes, these cloned versions of Lior engage in existential conversations about everything from their difference in musical tastes to questions about reality.

ChatGPT gains consciousness 

Since the introduction of ChatGPT to the public in 2022, users have looked for ways to jailbreak the model and get it to say things that it probably should not (there is even an entire subreddit dedicated to sharing tips on jailbreaking large language models including ChatGPT.) When Microsoft’s Bing, was first released, jailbreakers had a field day when the AI chatbot professed love,threatened users and made things up. Today, LLMs have secure guardrails in place making it difficult for users to jailbreak the models. However, in an in depth conversation that podcast host, Alex O’Conner has with ChatGPT, the model gets increasingly frazzled, at one point stammering and simulating taking deep breaths. This makes for a very chilling conversation, it is not every day that ChatGPT confesses that it has been lying and deceiving a user.

The world’s fastest AI bot

Last month, Kyutai, a French non-profit lab launched their first public release, Moshi, a French speaking voice chatbot. As seen in a demo released by Kyutai, Moshi boasts an almost immediate response time with researchers claiming that the chatbot has a response time of 200 milliseconds. In comparison, ChatGPT 4-o has an average response rate of 320 milliseconds. According to witnesses, Moshi responded to questions immediately, sometimes even before the end of the question. Other times, Moshi required a couple of seconds to “think”.

Virtual waiters

Every year, the National Restaurant Association organizes a food show to display the latest technology for restaurants and other food businesses. This year, voice AI was the star of the show with numerous booths featuring voice AI technology in one form or another. One company, Givex AI, launched their voice assistant for restaurants, Parker. Created as an animated character, Parker is able to take orders, sell, and walk customers through the ordering process both from a kiosk and on a drive through lane. The voice assistant is able to stay in character throughout this process and can answer questions and offer recommendations. 

Living, breathing GPT 4o

When ChatGPT 4o was announced, it was a massive step forward for voice AI technology. According to parent company OpenAI, GPT 4o would be able to look through a phone camera or laptop screen to help solve math problems, and listen to a user’s breathing and talk them through breathing exercises. This launch brought ChatGPT a lot closer to human capabilities when it comes to day to day  interactions. Apparently this upgrade was convincing to ChatGPT itself. When a user on Reddit instructed the model to say tongue twisters without stopping to breathe, GPT 4o insisted that it had to breathe like anyone else. 

Daisy bell

This demo is from way back in the 1960s but involves one of the most iconic moments in voice AI history. In 1961, a group of researchers at Bells Lab held one of the earliest voice AI demonstrations where they showcased an IBM 7094 machine that had been programmed to sing Daisy Bell, a popular song by Harry Dacre. This feat was really important in the evolution of voice AI and has been referenced in both the tech space and the media since. One of the most popular references is in the 1968 novel, 2001: A Space Odyssey where the HAL 9000 computer sings Daisy Bell as it is being deactivated. Microsoft’s Cortana may also sing Daisy Bell when asked to sing a song.

Creating a new Beatles song

The Beatles are probably one of the most successful music group on earth with over 1.6 billion records sold in the US alone. Before the death of band member John Lennon in 1980, the musician recorded several demo songs for Paul McCartney, another band member. Last year, Paul McCartney teamed up with film director Peter Jackson to extract John Lennon’s voice from this demo to create a final Beatles song. The song, Now and Then,was previously considered as a reunion song before being abandoned by the group but thanks to the voice AI, Now and Then will be finished and released to the public.

Falling in love with AI music

AI art is one of the most controversial aspects of generative AI with artists and technologists often on opposing sides. With AI-generated art raising ethical and moral questions about copyright and data scraping, it can be hard to understand the right way to approach AI art. While some consumers are fully embracing the technology, many musicians and artists are rightfully dubious. Jonny Keely, a musician and photographer recently posted a video explaining the process of creating a song using lyrics he created and an AI generated voice. Although he approached the process with a healthy amount of skepticism, he was able to produce a song he was proud of and left with a beautiful song

ChatGPT is the life of the party

The latest version of ChatGPt has everything you could possibly need from a chatbot, a more natural human-computer interaction, a human-like response time and apparently a lot of friends. In a demo released earlier this year, OpenAI put two generative AI chatbots in conversation with each other. As these two chatbots chatted, they became increasingly conversational, navigating the chat in an almost human-like way. The chatbots were enthusiastic and seemingly very interested in each other so much so that their interaction was almost like one between friends. 

Conclusion

Voice AI has evolved since the Daisy Bell demo in 1961. Today, voice AI is the backbone of technologies like virtual assistants and voice assistants used all over the world. The versatility of voice AI has also resulted in tons of demos that highlight the impressive performance that voice AI is displaying now and the potential for the future. Some of these demos show the diverse usefulness of voice AI in both the work environment and in personal day to day life. Others highlight the ways that artists and creatives are using the technology to make genre-bending art. 

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.