Article·Announcements·Dec 7, 2023

Coming Soon: Deepgram Aura, Conversational Text-to-Speech for Voice AI Agents

Natalie Rutgers
By Natalie Rutgers
PublishedDec 7, 2023
UpdatedJun 13, 2024

tl;dr:

  • The LLM-centric future demands a rethinking of how we approach speech-to-text and text-to-speech.

  • Deepgram is unveiling Aura, an innovative text-to-speech model that delivers human-like quality conversation that is faster and more efficient than all voice AI alternatives.

  • Sign up on our waitlist and be the first to access Aura, our new text-to-speech API!

Meet Deepgram Aura: real-time text-to-speech for real-time AI agents

It’s been a year since large language models (LLMs) seemingly went mainstream overnight (Happy Birthday, ChatGPT!!!), and the world has witnessed both rapid development of these technologies and immense interest in their potential. We believe that we have reached an inflection point where voice-based interfaces will be the primary means to accessing LLMs and the experiences they unlock. Here are a few recent signals in support of our thesis:

  • Good old fashioned voice notes are enjoying a healthy resurgence.

  • According to a recent survey, a majority of respondents stated phone calls are still their preferred communication channel for resolving customer service issues.

  • An emerging boom in wearable devices equipped with continuous listening and speech AI technology is gaining steam.

  • OpenAI recently enabled voice interactions in ChatGPT.

  • A wave of interest in voice-first experiences and tools is sweeping across brands, investors, and tech companies.

Thanks to ChatGPT and the advent of the LLM era, the conversational AI tech stack has advanced sufficiently to support productive (not frustrating) voice-powered AI assistants and agents that can interact with humans in a natural manner. We have already observed this from our most innovative customers who are actively turning to these technologies to build a diverse range of AI agents for voice ordering systems, interview bots, personal AI assistants, automated drive-thru tellers, and autonomous sales and customer service agents.

While these AI agents hold immense potential, many customers have expressed their dissatisfaction with the current crop of voice AI vendors, citing roadblocks related to speed, cost, reliability, and conversational quality. That’s why we’re excited to introduce our own text-to-speech (TTS) API, Deepgram Aura, built for real-time, conversational voice AI agents.

Whether used on its own or in conjunction with our industry-leading Nova-2 speech-to-text API, we’ll soon provide developers with a complete speech AI platform, giving them the essential building blocks they need to build high throughput, real-time AI agents of the future. 

We are thrilled about the progress our initial group of developers has made using Aura, so much so that we are extending limited access to a select few partners who will be free to begin integrating with Aura immediately. With their feedback, we’ll continue to enhance our suite of voices and API features, as well as ensure a smooth launch of their production-grade applications.

Sign up for our waitlist today and become the first to try Aura!


What Customers Want

I feel the need, the need for speed

What we’ve heard from many of our customers and partners is that voice AI technology today caters to two main areas: high production or high throughput.

High Production is all about crafting the perfect voice. It's used in projects where every tone and inflection matters, like in video games or audiobooks, to really bring a scene or story to life. Here, voice quality is king, with creators investing hours to fine-tune every detail for a powerful emotional impact. The primary benefit is the ability to swap out a high-paid voice actor with AI where you have more dynamic control over what’s being said while also achieving some cost savings. But these use cases are more specialized and represent just a sliver of the overall voice AI opportunity.

On the flip side, High Throughput is about handling many quick, one-off interactions for real-time conversations at scale. Think fast food ordering, booking appointments, or inquiring about the latest deals at a car dealership. These tasks are relevant to just about everyone on the planet, and they require fast, efficient text-to-speech conversion for an AI agent to fulfill them. While voice quality is still important to keep users engaged, quality here is more about the naturalness of the flow of conversation and less about sounding like Morgan Freeman. But the primary focus for most customers in this category is on improving customer outcomes, meaning speed and efficiency are must-haves for ensuring these everyday exchanges are smooth and reliable at high volume.

Although high production use cases seem to be well-served with UI-centric production tools, high throughput, real-time use cases still mostly rely on APIs provided by the major cloud providers. And our customers have been telling us that they’ve been falling short, with insufficient quality for a good user experience, too much latency to make real-time use cases work, and costs too expensive to operate at scale.

More human than human

With Aura, we’ll give realistic voices to AI agents. Our goal is to craft text-to-speech capabilities that mirror natural human conversations, including timely responses, the incorporation of natural speech fillers like 'um' and 'uh' during contemplation, and the modulation of tone and emotion according to the conversational context. We aim to incorporate laughter and other speech nuances as well. Furthermore, we are dedicated to tailoring these voices to their specific applications, ensuring they remain composed and articulate, particularly in enunciating account numbers and business names with precision.

In blind evaluation trials conducted for benchmarking, early versions of Aura have consistently been rated as sounding more human than prominent alternatives, even outranking human speakers for various audio clips more often than not on average. We were pleasantly surprised by these results (stay tuned for a future post containing comprehensive benchmarks for speed and quality soon!), so much so that we’re accelerating our development timeline and publicly announcing today’s waitlist expansion.

Here are some sample clips generated by one of the earliest iterations of Aura. The quality and overall performance will continue to improve with additional model training and refinement. We encourage you to give them a listen and note the naturalness of their cadence, rhythm, and tone in the flow of conversation with another human.

Our Approach

For nearly a decade, we’ve worked tirelessly to advance the art of the possible in speech recognition and spoken language understanding. Along the way, we’ve transcribed trillions of spoken words into highly accurate transcriptions. Our model research team has developed novel transformer architectures equipped to deal with the nuances of conversational audio–across different languages, accents, and dialects, while handling disfluencies and the changing rhythms, tones, cadences, and inflections that occur in natural, back-and-forth conversations. 

And all the while, we’ve purposefully built our models under limited constraints to optimize their speed and efficiency. With support for dozens of languages and custom model training, our technical team has trained and deployed thousands of speech AI models (more than anybody else) which we operate and manage for our customers each day using our own computing infrastructure. 

We also have our own in-house data labeling and data ops team with years of experience building bespoke workflows to record, store, and transfer vast amounts of audio in order to label it and continuously grow our bank of high-quality data (millions of hours and counting) used in our model training.

These combined experiences have made us experts in processing and modeling speech audio, especially in support of streaming use cases with our real-time STT models. Our customers have been asking if we could apply the same approach for TTS, and we can.

So what can you expect from Aura? Delivering the same market-leading value and performance as Nova-2 does for STT. Aura is built to be the panacea for speed, quality, and efficiency–the fastest of the high-quality options, and the best quality of the fast ones. And that’s really what end users need and what our customers have been asking us to build.

What's Next

As we’ve discussed, scaled voice agents are a high throughput use case, and we believe their success will ultimately depend on a unified approach to audio, one that strikes the right balance between natural voice quality, responsiveness, and cost-efficiency. And with Aura, we’re just getting started. We’re looking forward to continuing to work with customers like Asurion and partners like Five9 across speech-to-text AND text-to-speech as we help them define the future of AI agents, and we invite you to join us on this journey.

We expect to release generally early next year, but if you’re working on any real-time AI agent use cases, join our waitlist today to jumpstart your development in production as we continue to refine our model and API features with your direct feedback. 



If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions or contact us to talk to one of our product experts for more information today.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.