🚀 Now Available: Nova-3 – Setting a New Standard for AI-Driven Speech-to-Text 🚀

Article·Announcements·Feb 12, 2025
7 min read

Introducing Nova-3: Setting a New Standard for AI-Driven Speech-to-Text

7 min read
Jose Nicholas Francisco
By Jose Nicholas Francisco
PublishedFeb 12, 2025
UpdatedFeb 12, 2025

TL;DR

  • Nova-3 advances Deepgram's industry-leading accuracy, extending its capabilities to a broader range of real-world enterprise use cases and challenging audio conditions.

  • Nova-3 is the first voice AI model to offer real-time multilingual transcription. 

  • Our latest model is also the first to provide users with demonstrably effective and highly accurate self-serve customization—enabling instant vocabulary adaptation without model retraining.

  • Superior accuracy: Nova-3 delivers industry-leading performance with a 54.3% reduction in word error rate (WER) for streaming and 47.4% for batch processing compared to competitors.

  • Preferred for multilingual support: Deepgram was preferred over Whisper on 7 out of 7 languages tested—reaching as high as an 8-to-1 preference on certain languages.


Nova-3 Offers Best-in-Class Transcription Accuracy for Challenging Enterprise Use Cases

Nova-3 is our most advanced speech-to-text model to date, redefining the benchmarks for accuracy and performance. Building on the strengths of its predecessors, it extends state-of-the-art speech recognition to an even broader range of complex, real-world scenarios.

Cutting-edge speech-to-text tailored to your specific use case

Research has shown that background noise, reverberation, and other acoustic interferences degrade the quality of voice signals. As a result, achieving accurate Automatic Speech Recognition (ASR) in noisy environments remains one of the most challenging problems in Voice AI. For example, transcribing a conversation at a drive-thru requires an ASR model capable of filtering out background noise such as car engines, static, and even the chatter of children in the backseat to accurately capture the customer’s order.

Nova-3 achieves breakthrough accuracy in challenging acoustic environments through several key innovations in data and modeling. At its foundation is a sophisticated audio embedding framework that uses representation learning to project audio into a highly compressed and expressive latent space. This approach enables efficient identification and sampling of under-represented acoustic conditions in our training data. We paired this with advanced audio-text alignment techniques that allow us to train on highly adversarial examples that traditional approaches would typically discard. The model's robustness extends beyond noise handling to rare vocabulary, achieved through targeted data augmentation that projects specialized long-tail vocabulary datasets into realistic acoustic conditions.

As a result, Nova-3 is equipped with advanced capabilities to handle the following scenarios:

  • Challenging acoustic conditions – Maintains high accuracy even in environments with significant speaker-to-microphone distance, overlapping speech, and background noise, such as air traffic control, drive-thrus, and call centers.

  • Real-time multilingual transcription – Accurately processes conversations spanning multiple languages in real-time, a critical advancement for public safety communications, emergency response (e.g., 911 calls), and global customer support.

  • Enhanced numeric recognition – Improves the transcription of number sequences and numeric entities, supporting technical applications in domains such as retail, healthcare, and banking.

  • Domain-specific vocabulary – Recognizes long-tail, industry-specific, and business-specific terminology, ensuring higher accuracy in specialized contexts like medical or legal transcription.

  • Real-time redaction for up to 50 entities – Enables the secure and immediate removal of sensitive personal information in live conversations, a crucial feature for industries requiring compliance-driven transcription, such as finance and customer support.

  • Improved English formatting – As evidenced in our documentation on punctuation and paragraph structuring, Nova-3 greatly enhances transcription readability. 

  • Greater word-level timestamp precision – Our latest improvements also include higher precision on timestamps, so that users know exactly when the speaker(s) say a particular word and how long it takes them to say that word aloud.

Let’s take a deeper look at one of the most groundbreaking enhancements in Nova-3: its ability to perform real-time multilingual transcription, a first in the industry.

Real-Time Multilingual Transcription: A First-of-Its-Kind Advancement

One of the breakthrough features of Nova-3 is its ability to process multilingual conversations in real time—an industry-first capability that unlocks a host of new possibilities for global operations.

Real-time multilingual speech recognition requires solving several fundamental challenges: learning to accurately transcribe words across multiple languages and detecting language switches within a single conversation, while contending with a severe scarcity of labeled training data for code-switched speech. Traditional approaches typically handle this through separate cascaded models or complex multitask frameworks that switch between different language-specific components. This creates computational overhead and can introduce latency in real-world applications, especially when speakers naturally alternate between languages at word level.

Nova-3 takes a fundamentally different approach by operating as a truly unified multilingual speech recognition system. The model is trained to naturally emit transcriptions that follow the speaker's language switches, without relying on explicit routing or language-specific mechanisms. This simplified architectural approach was enabled by a multi-stage training process that combines synthetic code-switched data at massive scale with carefully curated real-world datasets. This enables the model to maintain high transcription accuracy across languages while adapting to natural language transitions.

Deepgram serves a global user base, and as such, we have developed language handling capabilities that are equally versatile and adaptive. Nova-3 supports the ability to transcribe codeswitching conversations in real-time between 10 languages — English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. This capability represents a significant breakthrough for applications in global customer support, emergency response (e.g., 911 calls), multilingual meetings, retail interactions, and healthcare settings—where seamless, real-time transcription across languages is essential.

The impact of this advancement is particularly evident in high-stakes scenarios like emergency response. Imagine a 911 caller switching between Spanish and English while reporting a medical crisis. Traditional systems might lag or misinterpret key details, causing delays. Nova-3, however, can fluidly process such interactions in real time, ensuring that dispatchers receive accurate, immediate transcriptions without missing critical details.

Deepgram vs. Whisper: Why Customers Prefer Deepgram for Multilingual


To evaluate transcription quality, we conducted a comparative study measuring user preference between Deepgram’s multilingual model and OpenAI’s Whisper. Across the tested languages, Deepgram emerged as the preferred choice. On average, Deepgram was favored over Whisper in all seven languages tested, based on over 200 audio samples. Notably, in certain languages, Deepgram achieved a 8-to-1 preference ratio (see Fig 1.), underscoring its superior performance in multilingual transcription.

The First Voice AI Model to Offer Self-Serve Customization

Nova-3 is the industry’s first voice AI model to enable self-serve customization, empowering users to fine-tune the model for specialized domains without requiring deep expertise in machine learning. 

Traditional voice AI models often demand extensive, expert-led retraining to adapt to specific industries or terminologies. This process is both time-consuming and costly, as it involves gathering large datasets, manual intervention, and repeated testing to ensure accuracy. Such retraining cycles can take weeks or even months, delaying deployment and driving up operational costs. Additionally, the reliance on expert knowledge means that businesses often face high consultant fees and dependency on specialized teams to implement and update the model.

One of the primary challenges in retraining speech recognition models on the fly is the need for high-quality, labeled data. Gathering high-quality, annotated data for new terms is resource-intensive, and without sufficient data, the model may fail to adapt effectively. Additionally, integrating new terms while maintaining the model’s accuracy on existing vocabulary requires a delicate balance. If not carefully managed, adding new terms can disrupt the model’s overall performance, leading to degradation in recognition accuracy across all words.

Nova-3 takes a fundamentally different approach, inspired by advances in large language models. Rather than relying on conventional language model post-processing, the model incorporates a trained contextual mechanism that enables in-context learning at inference time. This allows Nova-3 to dynamically incorporate arbitrary key terms as additional context during transcription, while maintaining high accuracy across its existing vocabulary.


With the addition of Keyterm Prompting, developers can now instantly improve transcription accuracy by fine-tuning up to 100 key terms critical to their specific use cases. This process allows for rapid optimization, ensuring that the model recognizes and transcribes domain-specific vocabulary with greater precision. For example, if a customer orders a 'Classic Buttery Jack Burger with Halfsie Fries,' or a doctor prescribes 'Clindamycin and Tretinoin,' Nova-3 will transcribe these terms accurately.

Keyterm Prompting reduces the need for model retraining or customization, empowering businesses to make immediate improvements without waiting for resource-intensive processes to complete.

Audio data is inherently dynamic, with terminology ranging from industry jargon to product names and technical terms. Deepgram is equipped to handle this variability, providing robust solutions tailored to your unique use case.

Nova-3 Sets New Benchmarks for Both Streaming and Batch data

Beyond the introduction of new features, it is essential to highlight key performance metrics. Deepgram has long been the leader in streaming data accuracy since the launch of Nova-1 in 2023, and Nova-3 further extends this advantage, setting new industry benchmarks in transcription accuracy across both streaming and batch data.

To assess the model's accuracy, we employ the Word Error Rate (WER) metric. WER quantifies the number of errors in a transcription relative to every 100 words. For instance, if a transcription system generates 300 errors in a 1,000-word audio recording, the WER would be 30%. A lower WER signifies a higher degree of transcription accuracy.

To ensure a rigorous evaluation, we tested Nova-3 on a dataset comprising 2,703 audio files spanning nine distinct domains, totaling 81.69 hours of recorded data. These domains include:

  • Air Traffic Control (ATC)

  • Conversational AI (Con AI)

  • Drive-Thru

  • Finance

  • Medical

  • Meeting

  • Phone Call

  • Podcast

  • Video/Media

  • Voicemail

This diverse dataset reflects a broad range of real-world applications, ensuring a comprehensive assessment of model performance.

With this context in mind, we can now examine the specific performance numbers.

Boosting Streaming Accuracy with Nova-3


Specifically, Nova-3 boasts a median WER of 6.84% on real-time audio streams from a dataset representing diverse real-world scenarios. This marks a 54.2% improvement over the next-best competitor (see Fig. 3), which has a median WER of 14.92%. In comparison, Nova-2 held only an 11% advantage over competing models, meaning Deepgram has significantly expanded its lead in streaming data accuracy. This enhanced performance ensures more reliable real-time transcription for applications such as call centers and virtual assistants, ultimately improving user experience and operational efficiency.

Nova-3’s Improved Accuracy for Batch Data Transcription

For pre-recorded audio (“batch data”) such as recorded medical dictations, court proceedings, and earnings calls, Nova-3 further extends Deepgram’s accuracy advantage. Specifically, Nova-3 achieves a median WER of 5.26% (see Fig. 4), a 47.4% improvement over the next-best competitor with a WER of 10%. This reduction in errors ensures more precise transcriptions for industries that demand high accuracy, such as healthcare, legal, and finance, reinforcing Deepgram’s leadership in batch data transcription.

In addition, Nova-3 maintains the industry-leading inference speed established by Nova-2, ensuring fast turnaround times for real-time and high-volume transcription use cases. In benchmark testing, Nova-3 demonstrated comparable latency to Nova-2, delivering rapid transcription with minimal delay. This means users can expect the same exceptional performance, with median inference times remaining among the fastest available—up to 40 times faster than competing diarization-enabled speech-to-text models.

Getting Started with Nova-3

Getting up and running with Nova-3 is quick and effortless. To try the model now, follow this link.

To access the model, use `model=nova-3` in your API calls. For more information, please refer to our API Documentation.

Affordable AI Models for both Enterprise and Individual Developers

To compare the pricing between Nova-3 and other providers, refer to the graph below. Note that Deepgram is extending the accuracy frontier while bending the cost curve. Users, therefore, gain access to cutting-edge transcription AI without having to spend more money.

Nova-3 remains the highest performing and most affordable streaming model on the market for English and multilingual audio. Starting at $0.0077 per minute of streaming audio (see Fig 5.), Nova-3 is over 2x more affordable than cloud providers while offering a 53% reduction in WER (based on current listed pricing).

Leading the Future of Enterprise Voice AI

Deepgram’s research and engineering teams are committed to advancing the field of conversational AI, enabling enterprise use cases through a comprehensive suite of cloud and self-hosted speech-to-text (STT), text-to-speech (TTS), and full speech-to-speech (STS) APIs.

Beyond its core models, Deepgram provides a full-featured developer platform with a high-performance runtime designed for automation and scalability. The platform includes advanced capabilities such as synthetic data generation, model curation, model hot-swapping, and robust integrations—allowing developers to efficiently build and optimize voice-enabled applications. Continuous improvements to models and infrastructure ensure users benefit from the latest advancements, maximizing long-term value. With low customer COGS, Deepgram’s platform offers seamless updates and scalability, helping businesses stay competitive and future-proofed as they grow.

With over 50,000 years of audio processed and 1 trillion words transcribed, Deepgram has powered voice-enabled applications for industry leaders such as Citi, Vapi, Groq, Twilio, and Spotify—while also transcribing NASA’s communications between the ISS and Mission Control. 

Deepgram's advancements have driven innovation across diverse fields, from enhancing speech recognition in healthcare to transforming contact center AI and even powering voice technology in gaming. Language is fundamental to shape AI’s ability to enhance human-computer interactions. Deepgram focuses on advancing AI-driven communication to improve understanding and connectivity across industries.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.