Introducing Deepgram’s Voice Agent API

Listen to article06:59

tl;dr:
Meet the Deepgram Voice Agent API: real-time, conversational AI in one easy to use API
Enabling the future of Voice-First AI
Getting started

Listen to article06:59

tl;dr:

Today we’ve officially launched the newest addition to our Voice AI Platform, the Deepgram Voice Agent API–a unified voice-to-voice API that enables natural-sounding conversations between humans and machines.
Powered by the industry’s fastest, most performant speech recognition and voice synthesis models, our voice agent stack listens, thinks, and speaks naturally and in real time.
Experience our voice agent API yourself with our interactive demo or be among the first to build it into your product today! Sign up now to get started and receive $200 in credits absolutely free!

Meet the Deepgram Voice Agent API: real-time, conversational AI in one easy to use API

Deepgram is excited to unveil the latest addition to its voice AI platform–the Deepgram Voice Agent API, a unified voice-to-voice API for AI agents that enables natural-sounding conversations between humans and machines. With one powerful API, we enable enterprises and developers to easily create LLM-powered AI agents that listen, think, and speak with the same intelligence and emotive quality that a person can.

Listens, thinks, and speaks naturally and in real time (seriously).
Gracefully handles interruptions with first-of-its-kind end-of-thought (EOT) detection modeling.
Maximizes developer control, allowing builders to choose between open source, closed-source, and Bring-Your-Own LLMs.
Scales to serve your production workloads with costs that ensure it.
Meets your security and data privacy requirements with flexible deployment modes (including self-hosted options for VPC and on-premises).

“We believe that integrating AI voice agents from Deepgram will be one of the most impactful initiatives for our business operations over the next five years, driving unparalleled efficiency and elevating the quality of our service.”

–Doug Cook, CTO @ Jack in the Box

For a glimpse of what can be built with our new voice agent API, check out the videos below. In the first, we demonstrate a customer support use case where the AI voice agent leverages next-gen end-of-speech prediction to handle long pauses as a phone number-based ID is spoken. The agent delivers a responsive, natural conversational flow and provides a high-quality customer interaction involving company and product-specific context:

Loading video...

In this video, we showcase a drive-thru agent’s robust performance in accurately understanding a human speaker in a noisy outdoor environment. The agent demonstrates the ability to perform complex action-taking even when it’s interrupted by the speaker:

Loading video...

We’re also excited to share an early proof-of-concept prototype using our new voice agent API and we encourage you to try it firsthand with this interactive demo.

Try our interactive demo

Deepgram Voice Agent API for real-time AI agents

Launch Demo

Enabling the future of Voice-First AI

To the average consumer, terms like “voicebot” and “AI agent” are likely to conjure memories of frustrating interactions with traditional IVRs, text-based chatbots, and personal voice assistants like Siri and Alexa that fail to complete even the simplest of tasks. However, recent advances in generative AI technology have given us the tools we need to finally build engaging, human-like voice agents that have the potential to transform the business world:

Speech-to-text (STT) - Low latency and superhuman transcription accuracy, like Deepgram’s Nova-2 STT delivers, for inputting spoken words from a human into the AI agent.
Text-to-speech (TTS) - Low latency, natural-sounding voice synthesis, like Deepgram’s Aura TTS provides, that delivers human-like spoken output to a human from the AI agent.
Large language models (LLMs) - Powerful, responsive generative AI models, like Llama 3 and GPT-4, that are the brains of the modern conversational AI tech stack and used for chat completion and task execution.

Fig. 1: Enterprise AI voice agents require a modern tech stack

To build the voice-powered agentic AI future, developers must integrate these key components, but there’s much more to consider beyond simply linking these elements in a pipeline and orchestrating the handoffs between them. Crafting engaging, enterprise-grade voice agents requires world-class engineering at the model level to effectively tackle key challenges and infuse a human touch into the artificial. Key focus areas include:

Noisy audio: Real-world audio is messy and full of background noise and disparate environmental conditions that need to be dealt with properly by the speech-to-text model.
Lightning-fast responses: Current response times often exceed 1.5 seconds or more but must be brought down below a second of latency or less to ensure conversations flow naturally without awkward pauses or delays just as we’re accustomed to in typical human interactions.
Conversational cues recognition: Agents must adeptly navigate the subtleties of conversational cues–knowing when to pause and when to continue when interrupted, and understanding when a speaker has finished or intends to proceed–to enable a smooth interaction with the same finesse human speakers exhibit in conversation.
Contextual intelligence: Voice agents need advanced understanding capabilities, naturally comprehending the context behind conversations, to respond with the most appropriate information and vocal expressiveness that feels genuine and empathetic, bringing a human touch to digital conversations.
Action taking: AI agents must understand intent and take action, from scheduling appointments to sending follow-up information, streamlining tasks and enhancing productivity.
Controllability: As the LLM landscape evolves rapidly, there isn't a universal model that fits all needs. Agent builders require flexible options, allowing them to select the optimal LLM or fine-tuned, task-specific language model that best aligns with their use case in terms of performance and cost efficiency.

Taken together, WORDS + CONTEXT + TIMING are what makes the effortless back-and-forth exchange of information and ideas comprising human conversation possible. And for the first time in history, we now have all the ingredients we need to truly replicate it in an artificial intelligence system.

At Deepgram, we've spent nearly a decade building, deploying, and managing thousands of voice AI models to process billions of hours of conversational audio in production. We've applied countless insights gained from these experiences into the development of our new voice agent API, optimizing both the models and system architecture to deliver exceptional performance that sets a new standard in human-machine interaction.

“As we watch our children use their smartphones, it's obvious that voice-to-voice will become a standard method of human and machine interactions. Deepgram's Voice Agent API addresses this market opportunity and makes customer service – already a top use case for GenAI – easier by converting text conversations to speech. Deepgram also broadens the market opportunity by integrating with a wide array of large language models. I look forward to seeing how enterprises use Deepgram to enable current and future AI use cases.”

–Kevin Petrie, VP of Research @ BARC US

Getting started

Several participants in our Enterprise Voice AI Accelerator Program are already nearing the launch of their first AI voice agents built using our new API, and we’re excited to share their progress in an upcoming article—so stay tuned! If you’re facing challenges building, deploying, or scaling real-time voice agents, we can help. Our new API is now available for early access to select customers. Fill out the request form below to start developing enterprise-grade AI voice agents today!

If you have any feedback about this post, or anything else regarding Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions, Discord community, or contact us to talk to one of our product experts for more information today.

Listen to article06:59

tl;dr:
Meet the Deepgram Voice Agent API: real-time, conversational AI in one easy to use API
Enabling the future of Voice-First AI
Getting started

Listen to article06:59

tl;dr:

Today we’ve officially launched the newest addition to our Voice AI Platform, the Deepgram Voice Agent API–a unified voice-to-voice API that enables natural-sounding conversations between humans and machines.
Powered by the industry’s fastest, most performant speech recognition and voice synthesis models, our voice agent stack listens, thinks, and speaks naturally and in real time.
Experience our voice agent API yourself with our interactive demo or be among the first to build it into your product today! Sign up now to get started and receive $200 in credits absolutely free!

Meet the Deepgram Voice Agent API: real-time, conversational AI in one easy to use API

Listens, thinks, and speaks naturally and in real time (seriously).
Gracefully handles interruptions with first-of-its-kind end-of-thought (EOT) detection modeling.
Maximizes developer control, allowing builders to choose between open source, closed-source, and Bring-Your-Own LLMs.
Scales to serve your production workloads with costs that ensure it.
Meets your security and data privacy requirements with flexible deployment modes (including self-hosted options for VPC and on-premises).

“We believe that integrating AI voice agents from Deepgram will be one of the most impactful initiatives for our business operations over the next five years, driving unparalleled efficiency and elevating the quality of our service.”

–Doug Cook, CTO @ Jack in the Box

Loading video...

We’re also excited to share an early proof-of-concept prototype using our new voice agent API and we encourage you to try it firsthand with this interactive demo.

Try our interactive demo

Deepgram Voice Agent API for real-time AI agents

Launch Demo

Enabling the future of Voice-First AI

Speech-to-text (STT) - Low latency and superhuman transcription accuracy, like Deepgram’s Nova-2 STT delivers, for inputting spoken words from a human into the AI agent.
Text-to-speech (TTS) - Low latency, natural-sounding voice synthesis, like Deepgram’s Aura TTS provides, that delivers human-like spoken output to a human from the AI agent.
Large language models (LLMs) - Powerful, responsive generative AI models, like Llama 3 and GPT-4, that are the brains of the modern conversational AI tech stack and used for chat completion and task execution.

Noisy audio: Real-world audio is messy and full of background noise and disparate environmental conditions that need to be dealt with properly by the speech-to-text model.
Lightning-fast responses: Current response times often exceed 1.5 seconds or more but must be brought down below a second of latency or less to ensure conversations flow naturally without awkward pauses or delays just as we’re accustomed to in typical human interactions.
Conversational cues recognition: Agents must adeptly navigate the subtleties of conversational cues–knowing when to pause and when to continue when interrupted, and understanding when a speaker has finished or intends to proceed–to enable a smooth interaction with the same finesse human speakers exhibit in conversation.
Contextual intelligence: Voice agents need advanced understanding capabilities, naturally comprehending the context behind conversations, to respond with the most appropriate information and vocal expressiveness that feels genuine and empathetic, bringing a human touch to digital conversations.
Action taking: AI agents must understand intent and take action, from scheduling appointments to sending follow-up information, streamlining tasks and enhancing productivity.
Controllability: As the LLM landscape evolves rapidly, there isn't a universal model that fits all needs. Agent builders require flexible options, allowing them to select the optimal LLM or fine-tuned, task-specific language model that best aligns with their use case in terms of performance and cost efficiency.

“As we watch our children use their smartphones, it's obvious that voice-to-voice will become a standard method of human and machine interactions. Deepgram's Voice Agent API addresses this market opportunity and makes customer service – already a top use case for GenAI – easier by converting text conversations to speech. Deepgram also broadens the market opportunity by integrating with a wide array of large language models. I look forward to seeing how enterprises use Deepgram to enable current and future AI use cases.”

–Kevin Petrie, VP of Research @ BARC US

Introducing Deepgram’s Voice Agent API

Table of Contents

Table of Contents

tl;dr:

Meet the Deepgram Voice Agent API: real-time, conversational AI in one easy to use API

Enabling the future of Voice-First AI

Getting started

You may also like...

Unlock voice AI at scale with an API Call

Unlock voice AI at scale with an API Call

Table of Contents

Table of Contents

tl;dr:

Meet the Deepgram Voice Agent API: real-time, conversational AI in one easy to use API

Enabling the future of Voice-First AI

Getting started

You may also like...

Unlock voice AI at scale with an API Call

Unlock voice AI at scale with an API Call