Article·Announcements·Mar 12, 2024

Introducing Deepgram Aura: Lightning Fast Text-to-Speech for Voice AI Agents

Josh Fox
By Josh Fox
PublishedMar 12, 2024
UpdatedSep 30, 2024

tl;dr:

  • Today we’ve officially launched the latest component of our Voice AI Platform, Deepgram Aura–the first text-to-speech model built for responsive, conversational AI agents and applications.

  • Aura includes a dozen natural, human-like voices with lower latency than any comparable voice AI alternative and is already being used in production by several of our customers.

  • Experience Aura yourself with our open source tech demo or be among the first to build the Aura API into your product today! Sign up now to get started with Aura and receive $200 in credits absolutely free!

Announcing Deepgram Aura Text-to-Speech API

In the three months since our initial preview announcement for Aura–our inaugural text-to-speech model–we’ve been amazed by the progress our early adopters have made. In fact, several are already successfully running Aura in production with many more nearly ready to deploy soon. That’s why we’re so excited to officially announce the general availability of Aura, representing an important expansion of the Deepgram Voice AI platform.  

With the large language model (LLM) gold rush in full swing (and gaining additional momentum by the day), the race is on among upstart innovators and tech titans alike to redefine the world of computing and the way we’ll interact with technology itself in the future using nothing more than our voice. However, before such a world can be built, significant limitations in speed, cost, quality, and scale must first be overcome. But with Deepgram Aura, especially when paired with our industry-leading Nova-2 speech-to-text API, developers now have the tools they need to easily (and quickly) exchange real-time information between humans and LLMs to build responsive, high-throughput AI agents and conversational AI applications.

But don’t take our word for it. We built an early proof-of-concept prototype and open sourced it (bugs and all!). Experience Aura firsthand with this interactive demo, or try converting your own text to audio on our product page.



The Deepgram Voice AI Platform

With the addition of Aura, Deepgram now provides a comprehensive voice AI platform, offering a complete set of APIs developers need to create powerful voice AI experiences across a vast array of use cases. The platform consists of three main components corresponding to the primary phases of a conversational interaction, including:

  1. Listen: Using perceptive AI models like speech-to-text to accurately transcribe conversational audio into text.

  2. Think: Using abstractive AI models (LLMs and task-specific language models) that perform natural language understanding, information retrieval, and other intelligent tasks at the heart of an AI agent system.

  3. Speak: Using forms of generative AI models like text-to-speech and large language models (LLMs) to interact with human speakers just as they would with another person.


With today’s announcement, developers now have a one-stop shop in Deepgram for both speech recognition and voice generation APIs that enable the fastest response times and most natural-sounding conversational flow in a real-time voicebot and AI agent application.


Real-time performance for real-time voice agents

As we’ve previously discussed, text-to-speech use cases can be classified into two discrete categories: high production and high throughput. High production usage primarily focuses on best-in-class voice quality for niche content production use cases like audiobook narration and monologue-driven radio and TV ads. The main opportunity is leveraging AI to perfectly emulate a human speaker and avoid the cost of a highly paid voice actor, and these use cases are primarily served with UI-centric production tools.

In contrast, high throughput use cases are API-driven and all about scale and speed–handling many short, real-time conversations between humans and AI agents. From booking an appointment with your doctor to ordering fast food, these tasks are prevalent throughout our daily lives. Voice quality is certainly important, but the key to achieving positive outcomes is the naturalness of the conversational flow. Speed and efficiency are the name of the game for transcription and voice synthesis, as every millisecond counts, especially when an LLM is used as the brains in an AI agent system.

Recent demos built using Deepgram and the Groq® LPU™ Inference Engine are a powerful proof point of the experience a lightning fast voice agent can provide. Groq offers the fastest language processing accelerator on the market, and when paired with Deepgram’s STT and TTS APIs, enables the type of realistic, naturally flowing conversation between human and AI agent that previously only existed in the realm of science fiction. And just as importantly, these capabilities are available through powerful APIs that make it easy to build for high throughput use cases as shown in the video below.



Build with Aura today

Aura is engineered to offer the optimal mix of speed, quality, and efficiency – it's the quickest among premium options and the finest among the fast ones. When integrated alongside Deepgram Nova-2 STT and your chosen LLM, it enables real-time agents that can listen, act, and communicate with natural-sounding voices, revolutionizing customer interactions.

Aura will equip AI agents with lifelike voices, and it’s been developed with capabilities that replicate authentic human dialogues. This includes prompt replies, natural cadences that include pauses, audible breaths and hesitation sounds like 'uh' and 'um,' as well as dynamic adjustments in tone and emotion to suit the conversation's context.

Customers will initially have a choice amongst 12 English-speaking voices (7 male, 5 female) with additional voices planned for future releases. All of our voices are trained on high quality conversational datasets and have average response times below 250 ms for typical dialogue sequences. Aura will follow Deepgram’s standard usage-based pricing scheme and starts at just $0.015/1K characters.

Aura is available today to anyone with a valid Deepgram API key. That means all current customers can start building with it now, and all new signups can take advantage of our free $200 in credits, good for more than 13M characters' worth of voice synthesis using any of our available voices. 

Learn more from our API Documentation or visit our product page to explore the details of Aura, our new text-to-speech API. 


Customer Spotlight: Aura in production

A number of our early build partners have already successfully integrated Deepgram Aura into their applications and we’re excited to showcase a couple examples of their early work. 

Humach

Humach {humans + machines}, a digital and human contact center solutions company established in 1988, provides innovative Customer Experience (CX) solutions to leading Fortune 500 brands. The company specializes in developing conversational AI agents for various sales and support use cases, including retail, utility, travel, and healthcare, aiming to enhance the customer experience.

Humach uses Deepgram for transcription services and is integrating Aura Text-to-Speech (TTS) into its conversational flows to help retail clients manage the ordering process. The company selected Aura for its low latency and superior voice quality after comparisons with other vendors.

Vapi

Companies like Vapi.ai are democratizing voicebots and enabling the development of smarter and more engaging voice AI chatbots for everyone to use. Vapi offers an end-to-end platform for building voice assistants, and they are integrating Deepgram Aura because of its speed and human-like voice quality. Vapi has extensive experience working across the AI agent stack and understands how important low-latency performance is to drive user engagement and provide a compelling customer experience. 

Daily

Daily has been a long-time partner of Deepgram with integration of our speech-to-text APIs in their leading WebRTC API platform for audio and video. Daily gives developers everything they need to integrate audio and video call features into real-time applications, including AI-powered tools for conversational AI and voicebot use cases. Applications powered by Daily can be found across a number of domains including telehealth, edtech, and virtual collaboration. They recently integrated Deepgram Aura into their production toolkits for LLM-powered interactive voice use cases as showcased in the video below.


What’s Next

At Deepgram, we are committed to releasing early and often, and we will continue to expand Aura’s capabilities throughout the year and beyond. This includes making our voices even more lifelike and conversational, adding additional voices for new use cases, expanding language support for global applications, and more. As we’ve discussed, scaled voice agents are a high throughput use case, and their success will ultimately depend on a voice AI solution that strikes the right balance between natural voice quality, responsiveness, and cost-efficiency. We’re looking forward to continuing to work with innovative companies like Humach, Vapi, and Daily across speech-to-text AND text-to-speech as they leverage the Deepgram Voice AI platform to create the AI agent future.

To learn more, please visit our API Documentation or visit our product page to explore the details of Aura, our new text-to-speech API. Sign up now to get started and receive $200 in credits (good for more than 13M characters’ worth of voice generation) absolutely free!




If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions or contact us to talk to one of our product experts for more information today.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.