Language models, especially Large Language Models (LLMs), have essentially become the face of AI. However, there’s an insidious problem with them. Thus far, the AI community has primarily trained its AI on text data while neglecting audio data. As a result, we’re holding our LLMs back, since we’re only teaching them how to read/write, but never teaching them how to speak/listen.
Thankfully, however, a few companies out there are in the process of ameliorating this issue. While we follow the path to more robust LLMs, we have created a few incredible products along the way. One such product is a series of incredible text-to-speech (TTS) models, each with its own unique strengths. We’ve listed the seven best TTS models of 2024 (so far).
If you’re building an app that requires a voice—from a new GPS system to a video game or even an IVR system—these apps are for you!
ElevenLabs has been generating AI voices since 2022 with an emphasis on synthesizing speech that sounds as natural as possible in various languages. The video above showcases their technology’s skills with Spanish, English, German, Polish, and French.
Most recently, they released the ElevenLabs Dubbing Studio, enabling you to translate massive amounts of content for people all over the world. It supports 29 languages, and even the advertisement for the Dubbing Studio uses an ElevenLabs voice!
You can get started using ElevenLabs for free, and their API comes equipped with user-friendly documentation, guiding you on everything from Websockets to Streaming.
If you want to see ElevenLabs’ capabilities first-hand, click here.
Strengths: Extremely natural-sounding voices, unique dubbing studio
Most Common Use Cases: Videos, gaming, audiobooks, AI chatbots, general entertainment
Deepgram’s Aura model is the pinnacle of Text-to-Speech for real-time conversations. If you’re creating an IVR system or AI Agents to handle real-time conversations at scale, Aura is undoubtedly the best choice for you. With less than 200ms latency, Deepgram’s TTS model is perhaps the fastest the AI world has ever seen.
The video above displays the model’s extremely fast response time in a replication of a few real-life phone calls. As you can see, the latency consistently remains below 0.2 seconds. So long story short, if you need speed for any kind of real-time application, Deepgram’s Aura has you covered!
Furthermore, Deepgram has the goal of crafting text-to-speech capabilities that mirror natural human conversations, including timely responses, the incorporation of natural speech fillers like 'um' and 'uh' during contemplation, and the modulation of tone and emotion according to the conversational context.
“Deepgram showed me less than 200ms latency today. That's the fastest text-to-speech I’ve ever seen. And our customers would be more than satisfied with the conversation quality."
- Jordan Dearsley, Co-founder at Vapi
Click here for additional examples of Deepgram Aura in action!
Strengths: Extremely fast, natural-sounding voices, minimal latency, high throughput, lifelike
Most Common Use Cases: Real-time AI Voice Agents, IVR, Conversational chatbots, Contact centers, entertainment
🔊 WellSaid Labs
If you’re an enterprise, then WellSaid Labs could be for you! Offering a variety of high-quality AI voices, your business will be able to save time and money creating top-tier content by using WellSaid Labs’ technology. From Boeing to Intel and even Peloton, your company could be next in line to use the latest enterprise-level TTS technology.
One unique feature of WellSaid Labs is the fact that you get to control the tone, punctuation, and emphasis of your message manually, allowing you to essentially finetune these language models without having to delve into the model weights themselves. So if you want greater agency over the output of your TTS model, WellSaid Labs has the product for you!
Learn more here.
Strengths: High customization ability, AI Avatars, Regionalization
Most Common Use Cases: Enterprise-level AI, Branded content, Marketing
🚀 OpenAI TTS
Of course, OpenAI has dipped their toes into the TTS world as well. In fact, you can find six different toes of theirs with a quick Google search. These six voices are named alloy, echo, fable, onyx, nova, and shimmer. You can check out the way they sound right here.
Currently, these voices are optimized for English; however, OpenAI’s TTS model generally follows the Whisper model in terms of language support. And with regards to streaming real-time audio, you’ll see that OpenAI specifically supports chunk transfer encoding.
Overall, OpenAI has a fine model on their hands, so if you want a quick-and-easy start to coding with some sort of language model API, then check out OpenAI here.
Strengths: Optimized for English, Support for various formats (opus, aac, flac, etc;)
Most Common Use Cases: Narration, real-time streaming, in-app voices (ex: GPS)
Lovo AI not only offers 500+ text-to-speech voices in 100 languages, but their models can also evoke natural emotional expressions. If you need to create a realistic voiceover for a YouTube video or a video game, Lovo’s technology will work perfectly for you. Just type in your script, click “Generate,” and listen to the output speech!
Here’s a quick take from the Lovo team: “With its diverse range of customizable voices and accents, Text to Speech enables creators to deliver high-quality, engaging content that captivates their audience and elevates their videos to the next level.”
Thus, if you’re a content-creator, LOVO should undoubtedly be a weapon in your arsenal. Check out this link to learn more.
Strengths: Over 100 languages available, 2nd highest number of unique voices out of all providers on this list
Most Common Use Cases: Voicovers, videos, content creation
Would you rather listen to a PDF than read it? What about an email? Or even a really, really, really, really, really long text message? (Omg!)
If your answer to any of the above questions is “Yes,” then check out Speechify! With millions of downloads on Chrome, iOS, and Android, Speechify is certainly a titan in the text-to-speech industry. If you want to hear Snoop Dogg’s or Gwyneth Paltrow’s voice on command, just check out their landing page.
And if you want to hear celebrities speaking various foreign languages, download the app today. After all, there’s quite a good reason that Speechify has been featured in Forbes, Time, The Wall Street Journal, and The New York Times.
Give them a look here.
Strengths: Ease-of-use for individuals and teams, Offers celebrity and generic voices, speed enhancement
Most Common Use Cases: Productivity enhancement, entertainment, content creation
One stand-out feature of Murf is its diversity of voices. Whether your use case is for a creative purpose or a corporate setting, you’ll find a Murf voice suitable for you! They support over 20 languages with over 120 TTS voices. Not to mention, if you have existing media—from videos to music to images—you can upload them into Murf and sync up any content with an AI voice.
Likewise, you can modify pitch, emphasis, speed, and interjections as you wish. If you need your media to sound as entertaining or professional as possible, the power is in your hands with Murf.
Curious to learn more? Click here!
Strengths: 4th highest number of voices available of all companies on this list, content synchronization, ability to modify the output at a word-level
Most Common Use Cases: eLearning, advertising, edu-tainment, L&D, training
PlayHT creates extremely realistic voices, indistinguishable from humans. You can even hear the AI generated voices “breathing” in between sentences for a more natural feel. Furthermore, they provide over 800 voices in over 130 langues. And if there’s a particularly niche term—whether its new slang or deep medical vocabulary—you can customize the way the voice pronounces these words.
Used by companies like Doordash, Hyundai, and Salesforce, their technology not only generates but also clones various voices. The clip above showcases their range, from Optimus Prime to Oprah Winfrey. And if you want to create an AI Podcast, PlayHT provides that service as well.
Click here to learn more!
Strengths: Offers the most voices of any provider on this list, can create custom AI voices, caters to both individuals and enterprises, includes various accents.
Most Common Use Cases: Conversational AI, videos, narration, entertainment, advertising
(In a parrot voice) Squawk! Polly want a cracker!
As stated by Amazon themselves: “Amazon Polly uses deep learning technologies to synthesize natural-sounding human speech, so you can convert articles to speech. With dozens of lifelike voices across a broad set of languages, use Amazon Polly to build speech-activated applications.”
With thirty-seven different languages supported in a variety of voices such as Danielle, Gregory, and Ruth, you’ll find that Amazon Polly is an incredible tool.
Check out Amazon Polly here!
Strengths: Uses SSML tags, lifelike, 5m characters free per month for 12 months
Most Common Use Cases: RSS feeds, websites, videos, app creation, e-learning, telephony
Google’s Text-to-Speech AI
Google’s TTS models were built based on DeepMind’s speech synthesis expertise. Supporting more than 380 voices across over fifty languages, you’ll undoubtedly be able to find the voice that works best for your next project.
Google also offers the option to create your own unique voice. Simply contact a member of their sales team, and they’ll be able to help you out with implementation. Long story short, if you have a set of audio recordings on-hand, you can use that data to train a custom voice model. The result is a Text-to-Speech AI personalized for you and/or your brand.
Strengths: 3rd highest variety of voices of all providers on this list, DeepMind-based, $300 in free credits upon signing up, customizability
Most Common Use Cases: Voice user interfaces, automated customer interactions
Microsoft Azure TTS AI
Microsoft’s text-to-speech voice—named Neural—is their free, out-of-the-box option which allows 500,000 characters’ worth of speech per month. However, much like Google, you can create a custom neural voice (aptly named “Custom Neural”) as well!
(Note: The video above showcases Microsoft Neural’s capabilities in 2020.)
The secret behind their natural-sounding AI? Well, as Microsoft says themselves, “Microsoft neural text to speech capability uses deep neural networks to overcome the limits of traditional speech synthesis regarding stress and intonation in spoken language. Prosody prediction and voice synthesis happen simultaneously, which results in more fluid and natural-sounding outputs. Each prebuilt neural voice model is available at 24 kHz and high-fidelity 48 kHz.”
Want to learn more? Check out this link!
Strengths: Natural-sounding, customizable output, flexible deployment
Most Common Use Cases: Marketing, advertising, vocal interfaces, entertainment, chatbots