🚀 Voice Agent API is Now Generally Available 🚀

Article·AI Engineering & Research·Jun 23, 2025

Model Comparison: When to Use Nova‑2 vs Nova‑3 (for Devs)

This article is more of a reference, rather than a blog. Use it in the same way you’d use a dictionary or encyclopedia: You don’t have to read all the way through. Just skim it for the stats, code snippets, or documentation links most relevant to you!
By Stephen Oladele
Updated
Published

How to read this guide

  1. This article is more of a reference, rather than a blog. Use it in the same way you’d use a dictionary or encyclopedia.

  2. Skim the TL;DR  if you need a 30-second answer.

  3. Dive into the Decision Framework for a branch-by-branch rationale.

  4. Check Benchmarks & Cost Graphs to validate your gut feel with numbers.

Grab the code snippets to trial the model in under five minutes.

⏩ TL;DR – 30-Second Cheat Sheet

  • Choose Nova-2 for English-only, low-cost, high-speed batch transcription.

  • Choose Nova-3 for real-time, multilingual, or use cases that require very specific vocabulary. (Note: Only Nova-3 offers keyterm prompting.)

Nova-3 is the latest and greatest speech-to-text model we’ve made thus far. And while Nova-2’s wider range of mono models temporarily performs better for specific use cases today, Nova-3 will be the only model receiving focused improvement on those use cases.

🔗 Try both models in the API Playground

Use this article as a reference

This article is more of a reference, rather than a blog. Use it in the same way you’d use a dictionary or encyclopedia: You don’t have to read all the way through. 

Just skim the headings to find the most relevant information for your particular use case or implementation. Below, you can find helpful links to certain documentation, copy code snippets at a moment’s notice, and compare stats across our models.

Why Model Choice Matters in Choosing Speech-to-Text (STT) APIs for Your Apps

You’ve just shipped a killer voice feature—only to discover that your transcription layer chokes on accents, glitches in noisy cafés, or can’t keep up with code-switching callers.

Minutes of latency or a single mis-transcribed proper noun can tank the entire user experience—or worse, your bottom line.

Modern speech-to-text APIs feel interchangeable until you run them at scale. The voice stacks live inside hard latency budgets (think <300 ms end-to-end) and ever-tighter error tolerances (think <5 % WER for regulated domains). Model choice now dictates:

  • User trust: A 1-second lag in a checkout flow drops conversion by ~7 %.

  • Infrastructure spend: Up to 40× cost delta between legacy ASR and next-gen models.

  • Feature velocity: Self-serve customization (e.g., Keyterm Prompting) lets teams adapt in hours, not weeks.

What’s at stake when comparing modern speech-to-text APIs?

If you use some random, lackluster speech-to-text API out of the box, you could run into the following obstacles:

Choosing the wrong model is rarely fatal on Day 1—but at 100 M audio minutes or 10 k concurrent streams, the compound cost (and brand impact) is real.

Deepgram’s Nova models dodge these problems in their own unique ways.

This guide is your field manual:

  • Data-backed decision framework that maps common voice workloads to the optimal Nova model.

  • Benchmarks that matter (WER, turnaround-time, €/$ per million words) pulled from 2025 test runs across nine audio domains (Air Traffic Control, 

  • Conversational AI, Drive-Thru, Finance, Medical, Meeting, Phone Call, Podcast, Video/Media, Voicemail).

  • Copy-paste code for batch and streaming calls—plus tips for fallback logic and cost caps.

  • Visual workflows that show exactly where model choice sits in a production voice stack.

By the end of this article, you'll clearly understand the trade-offs to help you deploy the best-fitting STT model in your dev console. 

Whether your scenario calls for Nova-2's affordability and blazing batch speed, or Nova-3's multilingual real-time precision and customization, you'll know exactly which model to choose so your voice product ships faster, sounds sharper, and operates more efficiently.

Overview of Deepgram's Nova Model Line

If you’ve browsed our docs or checked out the Playground, you might be asking: “Why does Deepgram have two flagship models named Nova?” Good question—and understanding their origins and intended strengths will clarify your decision-making process.

Why Two Nova Models Exist

Developers have long been asking Deepgram for two contradictory things:

  1. “Make it cheaper and faster for my daily workloads.”

  2. “Make it impossibly accurate in messy, multilingual conditions.”

Instead of forcing a compromise through a single “do-it-all” model, the research team split the roadmap into two, tuning each branch for a different sweet spot—so you only pay for what you actually need.

Nova-2: Balanced Performance for High-Scale Voice Apps

Released initially as a step-up from earlier Nova architectures, Nova-2 quickly became the go-to for developers building scalable voice solutions—particularly those who process massive volumes of English audio at batch scale or streaming at low latency.

What you should know:

Launch: Early-access November 2023; GA mid-2024.

Core win: Speed-per-dollar—batch inference ≈ 29.8 s/hr with diarization (fastest we’ve measured); $0.0043/min list price (batch).

Accuracy sweet spot: English (and 6 additional languages added in late 2024) with median WER ≈ 8.4 % on real-world data sets.

Hidden super-power: Because Nova-2’s weights are smaller, containers spin up ~25 % faster in on-prem deployments.

Nova-3: Optimized for Complex, Multilingual, and Customized Audio

Building on Nova-2’s strong foundation, Nova-3 introduced dramatic improvements specifically designed for demanding environments. 

It solves previously unsolvable issues, especially where accurate, real-time multilingual transcription, complex acoustics, or highly specialized vocabulary are non-negotiable.

What you should know:

Launch: General Availability (GA) February 2025; multilingual GA April 2025.

Core win: Accuracy under chaos—54 % lower streaming WER vs competitors; first ASR to handle live code-switching across 10 languages.

Hidden super-power: Performs PII redaction on up to 50 entities in real time—critical for finance and healthcare compliance.

Key features:

  • True code-switch across 10 languages (EN, ES, FR, DE, HI, RU, PT, JA, IT, NL) in a single, real-time stream.

  • Keyterm Prompting—inject up to 100 domain terms, up to 6× lift on domain vocabulary without retraining.

  • Domain variants (Nova-3-Medical) for clinical vocab.

  • 6.84 % median WER (streaming, nine domains).

  • Parity latency with Nova-2 (< 300 ms).

Cost note—streaming starts at $0.0077/min (batch tier $0.0066/min). Balance accuracy needs against budget.

🌇 Deepgram’s Nova Models Timeline

To help you visualize these differences clearly, here’s the timeline of Deepgram’s Nova series evolution, highlighting each model’s strengths and suggested application scenarios:

Which Model Is Right for You? Nova-2 or Nova-3?

Voice features live or die by milliseconds and accuracy percentage points. Deepgram’s Nova line gives dev teams a choice between the two specialized models:

💸 Nova-2

  • Built for: Blazing batch speed and affordability for English-heavy audio

  • Key trade-off: Prioritizes ultra-low cost and inference speed over advanced real-time features.

🌍 Nova-3

  • Built for: Superior real-time accuracy, robust multilingual transcription, self-serve customization

  • Key trade-off: Slightly higher price & model size to unlock smarter features

But which one should you call first?

30-Second Decision Framework in a Video:

Nova-2: When to Use It

Nova-2 is optimized for speed, affordability, and simplicity. It’s ideal if your use case emphasizes large-scale processing at controlled costs.

✅ Optimal for:

  • Latency-sensitive English streams, such as conversational voicebots and gaming chat.

  • Bulk captioning and meeting-minute pipelines where $$$/min rules (media archives, podcasts, meeting summaries).

  • Cost-optimized pipelines where sub-10% WER suffices and the cloud bills are under the CFO's (Chief Financial Officer) microscope.

🧪 Example Use Case: Podcast Captioning with Latency and Budget Constraints

Podcast production teams captioning 1,000+ hours of long-form interviews at under $0.0045/min without latency concerns. Nova-2’s batch speed (29.8 s/hr with diarization) means your production pipeline stays fast and affordable—perfect for large volumes of content without budget overruns.

🛠️ Recommended: Parse Podcasts With Python: Understanding Lex Fridman’s Podcast With Deepgram ASR And Text Analysis

Quick Test - Nova-2 (Batch English)

Here’s what the transcript looks like:

Pro tip: Nova-2 has the following model options which can be called by using the following syntax: model=nova-2-{option}. See all the specialized models in the docs.

🔗 Use Nova-2 in Deepgram Playground.

Other sweet-spot use cases:

  • Large-scale media captioning

  • Async call-center QA analytics

  • Cost-capped meeting summarization pipelines. 

Nova-3: When to Use It

Nova-3 is your go-to when accuracy can’t falter. It's specifically designed to handle complexity—noisy environments, multilingual interactions, and critical domain-specific contexts.

✅ Optimal for:

  • High-stakes, regulated domains (medical, legal transcription) chasing <6 % WER.

  • Noisy or far-field audio (drive-thru ordering systems, call centers, body-cam, ATC).

  • Multilingual CX where callers switch languages mid-sentence.

  • Fast iteration of brand or jargon vocabulary recognition without model retraining (Keyterm Prompting).

🧪 Example Use Case: Medical Transcription at 95%+ Entity Accuracy

When clinicians rely on accurate patient notes, even small errors risk compliance and patient safety. 

Nova-3, optimized for medical vocabulary via Keyterm Prompting and noise reduction, ensures notes are precise enough for sensitive healthcare documentation.

🏥 See Also: Introducing Nova-3 Medical: The Future of AI-Powered Medical Transcription

Quick Test - Nova-3 (Batch English)

Here’s what the transcription looks like:

Pro tip: Use model=nova-3&language=multi in your API calls. And add smart_format=true to auto-punctuate and clean entities without extra post-processing.

Nova-3 has the following model options, which can be called by using the following syntax: model=nova-3-{option}

  • general: Optimized for everyday audio processing.

medical: Optimized for audio with medical-oriented vocabulary.

Python example of Nova-3 (Real-time, Multilingual):

Example of Curl (streaming, multilingual):

🔗 Use Nova-3 in Deepgram Playground.

Other sweet-spot use cases:

  • Live contact-center agent assist; Nova-3 real-time + Keyterm Prompting instantly recognizes product SKUs.

  • Drive-thru order capture or far-field mics; Nova-3’s noise-robust encoder & numeric entity boost.

  • IVR in noisy environments (EN⇄ES⇄FR).

  • Regulated verticals needing sub-6 % WER + PII redaction.

Per-segment routing in long calls

Because Deepgram bills by time × tier, you can segment a single long audio file:

  1. Detect noisy or multilingual segments → send to Nova-3.

  2. Detect clean mono-English segments → send to Nova-2.

The result: super-human accuracy where you need it, base-tier cost where you don’t—all under one API key, without vendor lock-in gymnastics.

Benchmarks and Performance Tests That Matter

While features and pricing draw attention, benchmarks close the deal. Deepgram publishes full, reproducible test suites. 

Below are the headline metrics devs care about—accuracy, latency, customization lift, language reach, and cost/-min—with links to raw data so you can audit the numbers.

Accuracy Benchmarks

Transcription accuracy is typically measured using Word Error Rate (WER)—the lower, the better.

Deepgram evaluates models on real-world datasets, not just pre-cleaned speech corpora.

Nova-2 transcribes ~ 6 words correctly out of every additional 15 words that Azure would misrecognize in real-time traffic. 

Nova-3 fixes 1 out of every 2 errors the best cloud vendor still makes on live audio. Regarding user-experience upside, this means crisper live captions, fewer “say-that-again” IVR loops, and higher agent-assist confidence scores.

Nova-2 trims ~ 4 errors per 100 words compared with the best non-Deepgram vendor on long-form audio. 

In terms of practical impact, this means cleaner transcripts for downstream NLP and LLM pipelines, as well as less manual QC (quality checks) in media captioning and compliance archives.

Nova-3 cuts the error count almost in half on long-form or asynchronous jobs. The user-experience upside here is cleaner transcripts → lower human QC time → better downstream agent/LLM accuracy.

Rule of Thumb

  • Use Nova-2 when cost-per-minute and throughput dominate, and you can tolerate WER in the ~6–9 % band.

  • Step up to Nova-3 when multilingual, noise, or keyterm recall push you toward the 5–7 % WER frontier (and the budget allows).

Latency and Speed

Lower latency means transcripts arrive faster. When it comes to inference turnaround time, Nova-2 remains the fastest diarization-enabled model on the market, particularly in terms of high-volume batch inference speed. 

Nova-3’s larger stack (multilingual + keyterms prompting/customization logic) adds <10 ms overhead—negligible for UI flows.

Rule of Thumb

  • Async captions, podcast ingest, daily meeting dumps? Nova-2. Tip: parallelise 10–20 files per worker to saturate bandwidth.

  • IVR, agent-assist, drive-thru ordering? Nova-3 (if you need multilingual or keyterms), else Nova-2

  • Any path < 150 ms from mic to UI? - disable diarization (diarize=false) to cut turn-around time (TAT) by ~3–4 ×, stream 50-100 ms frames, and both Novas clear the bar.

Put simply: Pick Nova-2 for sub-second interactivity or ultra-cheap mass processing; flip to Nova-3 when users, regulators, or multilingual noise demand the extra brains—and your SLA can spare the extra few dozen milliseconds.

Complexity and Customization

Nova-3 allows dev teams to instantly boost accuracy for critical domain terms by injecting up to 100 custom keywords (e.g., brand terms, medical jargon) without needing expensive retraining. 

For example, if a customer orders a 'Classic Buttery Jack Burger with Halfsie Fries,' or a doctor prescribes 'Clindamycin and Tretinoin,' Nova-3 will transcribe these terms accurately.

Here's where Nova-3’s advanced architecture shines:

Pass up to 100 domain‑specific terms at inference time; see up to 625 % uplift in correct entity recognition (Talkatoo vet‑tech case study).

Add up to 100 domain terms at request time:

Rule of Thumb

  • If you need the transcript to adapt on the fly—key terms, multilingual code-switch, or industry-specific variants—reach for Nova-3.

  • If plain-vanilla English accuracy is enough, stick with the plug-and-play Nova-2.

Language Support

Deepgram serves a global user base, and as such, we have developed language handling capabilities that are equally versatile and adaptive.

Nova-3’s real-time multilingual engine is uniquely capable of seamlessly transcribing code-switching conversations (e.g., English↔Dutch) mid-stream without interruption across 10 languages. 

Nova-2’s multilingual support is currently limited to English↔Spanish.

Languages Nova-3 supports for real-time multilingual code-switching during conversations:

English (EN), Spanish (ES), French (FR), German (DE), Portuguese (PT), Hindi (HI), Russian (RU), Japanese (JA), Italian (IT), Dutch (NL)

Rule of Thumb

  • Stick with Nova-2 for mono-lingual pipelines; the moment you need two languages or code-switching, upgrade the endpoint to Nova-3.

Cost and Pricing

While performance metrics like accuracy and latency often dominate technical discussions, your finance team will quickly point out: at production scale, price per audio minute matters—sometimes critically.

Deepgram prices each Nova tier “by the audio minute.” To make apples‑to‑apples comparisons easier, the table below also shows an estimated cost per one million English words (assuming 160 words‑per‑minute conversational speech¹).

¹If your domain averages a slower 120 wpm (e.g., formal dictation), multiply the $/1 M words column by 1.33.

Calculating Monthly Costs at Production Scale

Imagine your application processes 1,000 audio hours per month. Here’s how the monthly totals stack up:

👉 Takeaway: Nova-2 saves roughly $138–$204 per 1,000 hours relative to Nova-3, depending on whether you use batch or streaming transcription.

Your Cost-Optimization Checklist

Estimate your monthly audio volume (minutes or hours).

Estimate average words per minute based on domain (typical conversational: 120–160 wpm).

Calculate monthly spending at various usage tiers to identify breakpoints.

Consider hidden costs of errors—customer churn, regulatory penalties, manual correction—when comparing Nova-2 vs Nova-3.

Trial both models on your own audio data at scale—track real-world error rates, customer complaints, and human rework costs.

Rules of Thumb:

  • For a high-volume, batch-intensive workload (e.g., podcast transcription, large video archives), Nova-2 can significantly trim your monthly cloud spend.

  • Batch jobs longer than 10 min can still request Nova‑3 for niche domains—budget that premium, where accuracy is contractual.

  • For mission-critical real-time scenarios (contact centers, multilingual customer service), Nova-3’s slightly higher per-minute rate buys essential accuracy and feature upgrades, potentially saving you more through lower call escalations or improved CX.

  • Tier routing available to mix models intelligently in prod: route Nova‑3 only on calls flagged multilingual/noisy; keep Nova‑2 everywhere else.

Prices are current as of May 2025. Check the Deepgram pricing page for updates.

Production Checklist for Devs Shipping Deepgram’s Speech-to-Text (STT) Models

Even the best STT model can stumble if you skip production-grade hygiene.  Before you ship voice pipelines to thousands (or millions) of users, sanity-check your workflow with this battle-tested checklist:

🚦 API Reliability & Scaling

Retry with exponential back-off on HTTP 429 (Too Many Requests) and 5xx (Server Error) responses.

Usage

✅ For lowest real-time latency, send smaller audio chunk_size (≈100 ms). Smaller chunks ensure <300 ms end-to-end UX responsiveness.

🛠️ Hybrid Traffic & Model Fallback

✅ For hybrid workloads, implement auto-fallback from Nova-3 → Nova-2 on language mismatch or temporary errors (e.g., ENG_ONLY fallback).

When can you skip the fallback entirely?

If you’re happy with Nova‑3’s multilingual coverage (10 languages) just call: `?model=nova-3&language=multi … and you won’t see the “English only” error, so no fallback is needed.

🔑 Efficient Keyterm Management

✅ Cache keyterm lists per campaign or customer to avoid rebuilding strings on every call. Cut redundant CPU cycles and keep API latency low.

📉 Quality Monitoring (Accuracy & Latency)

  • ✅ Log and review WER deltas weekly. Use Deepgram console insights to trigger model upgrades when your WER goals shift.

  • ✅ Actively monitor API latency & error rates via your observability stack (e.g., DataDog, Grafana).

🧪 Pre-Launch A/B Tuning

✅ Run model comparisons in Deepgram’s API Playground to sanity-check your chosen parameters (keyterms, formatting flags, model tiers) before deployment.

📚 Transcript Formatting & Redaction

✅ Always use smart_format=true to get clean, properly punctuated transcripts out-of-the-box. This significantly reduces NLP post-processing load.
✅ Also, enable redaction ("redact": ["ssn", "phone"]) when handling sensitive data (PII, PHI).

Deepgram lets you pass one or more redact= parameters in the URL, e.g.:

🚨 Heads Up: fixing a data‑leak incident is far more expensive than the tiny latency hit of server‑side redaction. Deepgram doesn’t charge extra for using redact, and the performance impact is negligible for most workloads.

The list above distills real‑world lessons from teams running billions of audio minutes through Deepgram. Tackle each category above, and you’ll avoid 90 % of the “why did latency spike at 2 a.m.?” Slack pings.

Ship it once—sleep at night.

Next Steps: When to Use Nova-2 vs Nova-3 (for devs)

You’re now equipped with the knowledge, benchmarks, and practical guidance to confidently choose between Deepgram’s Nova-2 and Nova-3 models. Here's how you turn this knowledge into immediate action:

Have questions or need help selecting a model for your specific use-case? Our product experts can guide you directly.

👉 Talk to a Product Expert

Resources

🌍 Try Nova-3 with Multilingual Audio

📜 Read Nova-2 Announcement

🌐 Read Nova-3 Announcement

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.