Article·AI Trends & News·Jul 27, 2024

Exposing AI Voice Agents: A Case Study

By Eteimorde Youdiowei
PublishedJul 27, 2024
UpdatedJan 17, 2025

Anyone who has ever placed a call to a call center can attest to the frustration of long waiting times. Call centers are notorious for their delays, often requiring customers to wait minutes or even hours before speaking to a human agent. This frustrates customers and burdens call center employees, leading to burnout and poor customer service.

The call center industry has traditionally relied on Interactive Voice Response (IVR) systems to address these issues. IVR allows users to interact with a prerecorded voice and select services via keypad or speech recognition. While this offers some level of automation, IVR systems are limited in intelligence and flexibility.

Enter AI voice agents! These systems bring intelligence to call center automation, allowing AI agents to communicate with customers like human agents. AI voice agents can reduce waiting times and alleviate the workload on human agents to improve customer satisfaction.

In this article, you will learn about the following:

  • The role of AI Voice Agents in modern call centers

  • How AI Voice Agents work and their advantages over traditional IVR systems

  • Use cases and implementation of AI Voice Agents in the call center industry

  • The impact of AI Voice Agents on the call center industry

What are AI Voice Agents?

AI voice agents communicate with humans via speech to perform a task. For example, a person can book a flight by calling a call center. The person would make their request via speech, and the AI voice agent would listen to their request, process it in the background, and respond to them to confirm the status of their flight reservation.

Unlike chatbots, which focus primarily on conversation, agents are more task-centric. Their primary purpose is to accomplish a task for the user with little to no supervision. AI voice agents take this further, with speech being their primary user interface.

The task of processing human speech is not trivial. When people speak, they don't just utter words; they also convey emotions in their speech. People also come from various backgrounds, which means they have different accents and language nuances. The agent must be able to handle all of these factors. 

How AI Voice Agents Work

The typical workflow of an AI voice agent follows these steps at the high level:

  • A user uses their smartphone to place a call

  • The user's speech is sent to the server that hosts the agent

  • Upon receiving the speech data, the agent processes it and generates a verbal response

  • The response is then sent back to the user for back-and-forth communication in real-time.

The AI voice agent architecture consists of the following four major components:

  • Streaming component

  • Speech-to-text (STT) model

  • Language models (LLMs)

  • Text-to-speech (TTS) model

Streaming Component

This component sends the user's speech data to the server that hosts the agent, and the generated audio from the agent is returned to the user. You can stream online using Voice Over IP (VoIP) technologies. Companies like Daily.co and Agora are good choices for streaming services.

You can also stream through traditional telephony technologies like the Public Switched Telephone Network (PSTN), which provides an option for users using conventional phone networks. Twilio is an excellent option for this integration.

Integrating VoIP streaming and telephony is a good option because both options can help cater to diverse user needs.

Speech-to-Text (STT) Model

When the audio data is streamed to the server, it is passed to an STT model (also known as an ASR model). This model converts the speech into text for processing.

Automatic Speech Recognition (ASR) models have been used in the call center industry well before the emergence of AI Voice Agents. They were used as a form of interactivity in IVR systems

For example, ASR models were used for simple tasks like determining which digit the user mentioned. They served as an alternative to dual-tone multi-frequency (DTMF) decoding, which involves the user pressing their keypad.

AI voice agents take ASR models one step further. Rather than just capturing single words or digits the user mentions, they use more sophisticated speech recognition models to capture what the user says. They then pass this extracted text to an LLM.

The ASR models that power the reasoning of these agents should be fast and have a low Word Error Rate (WER), which is essential for call centers that operate in real-time. Speech recognition models like Deepgram's Nova-2 and OpenAI's Whisper are examples of such models.

Large Language Model (LLM)

The LLM is the reasoning engine of the AI voice agents. It uses the text from the ASR model to understand the user’s intent and generate an appropriate response. The LLM can be used for simple tasks like generating responses. 

It can also perform more complex tasks involving external tools (weather APIs, search engines, etc.). For example, if the user wishes to update their personal information, the LLM would understand that intent, fetch the right tool to perform that task, and then reply to the user with the appropriate response.

Text-to-Speech (TTS) Model

The text-to-speech (TTS) model takes the output text from the LLM and generates speech. This involves using generative AI to create realistic-sounding voices. In the past, generated speech was made by concatenating pre-recorded phonemes (the smallest word unit).

Modern-day speech generation uses sophisticated AI models to produce voices almost indistinguishable from human speech. Such models include Aura by Deepgram and Multilingual v2 by Eleven Labs.

After the speech has been generated, it is sent back to the streaming component, which sends it back to the user. This cycle then repeats as the user and the agent interact.

Multi-modal AI Voice Agent

Multimodal models can process multiple modalities (different data types). While LLMs are typically unimodal, working primarily with text, Vision Language Models (VLMs) can handle text and images.

In the context of AI voice agents, the ability to work with text, vision, and audio would improve their efficiency and capability. Although audio remains the most important modality for the agent, integrating other modalities adds versatility. 

For instance, consider a scenario where a customer calls a call center to report a defective item. They could easily take a picture of the item and send it to the multimodal agent for a more accurate assessment. 

Another scenario is a user who needs to send an email or other textual data while on a call with the AI voice agent.

Multimodal AI agents offer several advantages:

  • Efficiency: Processing multiple data types simultaneously streamlines interactions and reduces system switching.

  • Improved Accuracy: Handling multiple modalities allows for a more comprehensive understanding of user inputs and reduces the risk of information loss.

  • Enhanced Customer Interaction: Multimodal agents can interpret nonverbal cues like laughter, giggling, or crying to provide a more empathetic and responsive customer service experience.

With the launch of multimodal models like GPT-4o by OpenAI, which can process user data in real-time, the prevalence of multimodal agents is set to increase. 

Additionally, smaller multimodal models like Moshi by Kyutai, which can perform real-time communication tasks similar to GPT-4o, will enable companies to run them on their infrastructure.

Benefits of AI Voice Agents in Call Centers

The advantages AI voice agents bring to the call center industry are substantial. Here’s a brief overview of their benefits:

  • 24/7 availability

  • Handle high call volumes

  • Provide consistent answers to common queries

  • Free up human agents for more complex issues

  • Reduce wait times and call abandonment

  • Reduced costs

  • Multilingual support

24/7 Availability

AI agents don’t sleep or take breaks. They provide round-the-clock service, ensuring that customers can get assistance any time of the day or night. This continuous availability helps address customer needs promptly, regardless of time zones.

24/7 availability is crucial, especially for a call center. It ensures that customers receive consistent and timely support. 

Love's, a company that provides highway hospitality, uses Replicant's thinking machine as a 24/7 agent to provide customer support anytime.

Handle High Call Volumes

AI voice agents can scale to high volumes of calls. They can efficiently handle numerous calls simultaneously without a drop in service quality. This capability is particularly beneficial during peak times or unexpected surges in call traffic.

Several customers of various Voice AI providers, such as Deepgram, Poly.ai, Replicant, and Gridspace, have reported that using these agents has helped reduce high call volumes. This has increased customer satisfaction (CSAT) and reduced attrition among human agents.

Provide Consistent Answers to Common Queries

Human agents can be inconsistent in their responses, but AI voice agents are usually consistent. They can quickly respond to frequently asked questions with accurate and reliable information. 

This consistency helps in building trust and reliability among customers. Art Coombs, the CEO of KomBea, mentioned in his interview with Deepgram at Project Voice X, the goal of AI voice agents in call centers is to combine the intelligence of humans with the consistency and accuracy of machines.

The Medicare Club, a healthcare startup that helps elderly people enroll in the proper health services, uses Gridspace's Grace AI voice agent to standardize outbound call leads. Diabetic Insurance Solutions, a life insurance brokerage, also uses Grace for the same purpose.

Free Up Human Agents for More Complex Issues

AI voice agents will not replace human agents; instead, they will assist them. They will handle routine and commonly encountered queries. 

At the same time, humans will be reserved for more complex and sensitive issues that require a personal touch—even though most agents will run on emotion engines. This division of labor enhances the call center's overall efficiency.

The Canadian Automobile Association (CAA) uses Replicant’s Thinking Machine to automate routine calls from its members, allowing human agents to tackle more complex and engaging issues.

Reduce Wait Times and Churn Risk

Long waiting times are the most annoying aspect of traditional call centers, often causing customers to abandon their calls. 

AI voice agents significantly reduce wait times by providing immediate responses, decreasing the likelihood of customers abandoning their calls out of frustration. See the CBS video below on how AI revamps the call center industry and how wait times can be annoying.

Tethr, for example, is an AI-powered conversation intelligence platform that helps a company like MetTel manage and monitor rep behavior across the contact center. Since implementing it, MetTel’s care team has seen a 35% decrease in escalated calls and a 28% decrease in repeat calls.

Reduced Costs

Implementing AI voice agents can lead to substantial cost savings for call centers. They reduce the need for a large human workforce, reducing salary and training expenses while maintaining high service levels.

For example, Five9 is an industry-leading cloud contact center solution provider that facilitates billions of call minutes annually. 

They use the real-time transcription accuracy of Deepgram’s STT models for improved call resolution and higher self-service containment rates without needing a live agent, saving their customers a ton of money. Learn more from this blog post.

Multilingual Support

AI voice agents can support multiple languages, catering to a diverse customer base and ensuring language barriers do not impede effective communication. This multilingual capability broadens the call center's reach and enhances service quality.

Companies like Vapi.ai, a Deepgram technology partner, also provide real-time multi-language support for customers to build call center applications. 

It showcases one of the benefits of voice AI agents: you can easily deploy a multilingual agent with little to no effort, which can scale more efficiently than finding and hiring human agents who speak that particular language.

See the complete list of languages and models Deepgram provides in this documentation.

Use Cases and Applications

The versatility and efficiency of AI voice agents make them valuable for key use cases and applications.

Customer Service and Support

AI voice agents are revolutionizing customer service by providing immediate, accurate, and consistent responses to customer inquiries. They can handle various support issues, from troubleshooting technical problems to answering frequently asked questions. 

Sharpen Technologies, for instance, is an agent-focused contact center platform that uses Deepgram’s voice AI services to simplify every customer and agent interaction for efficient resolution across voice, digital, and self-service channels.

Toyota uses the E-Care AI voice agent service that Cognigy, another Deepgram partner, developed. E-Care supports customers with any inquiries about their vehicles. It monitors the customer's vehicle and calls them when it discovers a fault.

Order Management and Tracking

AI voice agents streamline the process of placing, managing, and tracking orders on e-commerce platforms by providing real-time updates on order status, shipping details, and expected delivery times. 

Automating these tasks not only minimizes errors but also enhances operational efficiency.

Appointment Scheduling

AI voice agents simplify appointment scheduling and handle booking, rescheduling, and cancellations. They access calendars (a “tool”), check availability, and confirm appointments with customers, reducing the need for manual intervention. 

This automation ensures that appointments are managed efficiently and accurately, minimizing scheduling conflicts and errors. Customers enjoy the convenience of scheduling appointments at any time without waiting for a human agent's assistance. 

Sameday is a conversational AI tool that can schedule appointments and follow-ups for customer engagement.

Billing Inquiries

Billing inquiries are efficiently managed with the help of AI voice agents, who provide information about account balances, payment due dates, and transaction history. 

They assist customers in making payments, setting up payment plans, and addressing billing discrepancies. Businesses can reduce their human agent workload and provide accurate and timely account information by automating these tasks.

Kodif's intelligent automation platform uses agent shortcuts to reduce handle time by up to 40% and onboarding time by up to 70%. It ensures consistency in responses, which is crucial for maintaining a high-quality customer experience.

Outbound Calling and Proactive Notifications

AI voice agents effectively manage outbound calls and proactive notifications. These agents can remind customers about upcoming appointments, notify them about order updates, and inform them about new products and promotions. 

This proactive approach enhances customer engagement and ensures that important information is communicated effectively. Automating outbound calling allows businesses to reach a larger audience more efficiently.

Revenue.io, for instance, is a platform that automates the sales execution workflow for teams and provides real-time guidance. They used Deepgram’s ASR to train and customize a speech model for their agents using audio from their platform.

Considerations for Implementing AI Voice Agents for Call Center Applications

Several key considerations must be considered when implementing AI voice agents in a call center to ensure smooth integration and optimal performance.

Integrating with Existing Call Center Systems

If a business already has an existing call center system, careful consideration must be given to integrating it with an AI voice agent. For systems built on the Public Switch Telephone Network (PSTN), connection to the internet is required, often achieved through techniques like SIP Trunking. The same technique can be applied if the call center operates with a Private Branch Exchange (PBX).

Additionally, the agents need access to current information about the platform users to provide context during customer calls. This information typically resides within CRM platforms, customer databases, and other backend systems. 

Proper integration ensures that AI agents can access and utilize this customer information effectively for more personalized and accurate interactions.

Designing Effective Conversation Flows

Designing effective conversation flows for an AI voice agent involves careful prompting and fine-tuning. Properly crafted prompts guide the LLM in generating natural, intuitive dialogues anticipating customer needs. 

Fine-tuning the model for specific scenarios, such as FAQs, troubleshooting steps, and complex inquiries, ensures the AI agent can handle a wide range of interactions smoothly. 

Businesses can improve agents' ability to deliver a positive and efficient customer experience by focusing on precise prompting and iterative fine-tuning.

Gracefully Handling Accents, Noise, and Interruptions

You must select a good ASR model to ensure that AI voice agents can accurately understand and respond to different accents and dialects, regardless of the user's background. 

Additionally, they should handle background noise and interruptions without compromising conversation quality. Implementing an effective speech recognition model that incorporates features ensures that the AI agent can comprehend the speech and respond to users effectively, even in a noisy background.

Routing to Human Agents When Needed

While AI voice agents can handle a wide range of tasks, there will be situations where human intervention is necessary. Establishing clear protocols for routing calls to human agents ensures that complex or sensitive issues are addressed appropriately. 

This routing process should be seamless, with AI agents providing context and details to human agents to ensure continuity.

Measuring Performance and ROI

Measurement of performance and return on investment (ROI) are essential to evaluating the success of AI voice agent implementation. Key performance indicators (KPIs) such as call resolution rates, customer satisfaction scores, and average handling times, should be tracked.

Additionally, assessing the cost savings and efficiency gains from automation can help determine the overall impact on the call center's operations.

Conclusion: How AI Voice Agents Automate the Call Center

Thanks to AI voice agents, the call center industry is transforming: long wait times, poor service quality, and the burden of handling high call volumes are becoming issues of the past. This growth is still early, but we can expect further innovations that enhance the overall customer experience as technology evolves. 

Customer interactions are changing as call centers use advanced technologies like natural language processing and speech recognition. This evolution allows businesses to anticipate better and meet customer needs, increasing loyalty and satisfaction.

FAQs

What are AI Voice Agents?

AI voice agents are intelligent agents designed to interact with humans via speech to perform various tasks, such as answering inquiries, scheduling appointments, and handling customer service requests. They utilize natural language processing (NLP) and speech recognition to understand and respond to user inputs.

How do AI Voice Agents improve call center operations?

AI voice agents streamline call center operations by automating tasks such as responding to FAQs, managing orders, scheduling appointments, and handling billing inquiries. They can operate 24/7, reduce wait times, and free up human agents for more complex issues, ultimately improving efficiency and customer satisfaction.

What technologies power AI Voice Agents?

The technology behind AI voice agents includes several key components: automatic speech recognition (ASR) for converting human speech to text, a large language model (LLM) that serves as the agent's reasoning engine, and a text-to-speech (TTS) model that takes the output from the LLM and converts it back to speech.

Additionally, streaming and telephony components handle the receiving and sending of speech data. These technologies enable seamless and efficient interactions between AI voice agents and users.

Are AI Voice Agents replacing human agents?

AI voice agents are designed to complement human agents rather than replace them. They handle repetitive tasks and routine inquiries, allowing human agents to focus on more complex issues that require empathy, creativity, and critical thinking. 

In their current state, AI voice agents cannot be fully entrusted with sensitive tasks.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.