ElevenLabs Transcription vs. Deepgram: STT API Compared

Listen to article12:18

Key Takeaways
What Changed When ElevenLabs Launched Scribe v2
Scribe v2 vs. Scribe v2 Realtime: Two Different Use Cases
How Deepgram Nova-3 Is Positioned Against Scribe
Accuracy Under Real-World Audio Conditions
How Each Platform Performs on Telephony and Background Noise
Domain-Specific Terminology: Medical, Legal, and Contact Center Jargon
Custom Vocabulary and Keyterm Prompting Compared
Latency and Concurrency at Production Scale
Latency Profiles: Realtime Streaming vs. Production Latency Budgets
Concurrency Limits and What They Mean for Contact Centers
Batch Processing for High-Volume Workloads
Compliance and Deployment Architecture
HIPAA Compliance Paths: BAA Process and Data Residency
On-Premises vs. Cloud-Only Deployment Trade-offs
What Regulated Industries Need Before Technical Evaluation
Pricing Transparency and Cost at Scale
Cost Comparison at Common Production Volumes
Billing Models and Predictability
The Bundled Pricing Advantage for Voice Agents
Choosing the Right STT API for Your Use Case
Use Case Decision Matrix
Start Testing with Deepgram
FAQ
How Should You Run a Fair STT Proof-of-Concept?
What If You Need Realtime Diarization?
What Breaks Multi-Tenant Cost Predictability?
What Do HIPAA Security Reviews Commonly Ask For?
When Does On-Premises STT Beat Private Cloud?

Listen to article12:18

The conventional wisdom that "ElevenLabs handles voice synthesis and Deepgram handles transcription" no longer describes the full decision space. With Scribe v2 Realtime, ElevenLabs transcription now competes directly with Deepgram's Speech-to-Text API on the same production criteria: accuracy, latency, concurrency, and compliance. At 10,000 hours per month, the cost difference between the two platforms is meaningful—but list-price comparisons only tell part of the story before you factor in deployment architecture, retention policies, and concurrency limits. This article covers what changed and how to pick the right one without running a six-month bake-off.

Key Takeaways

If you're choosing between ElevenLabs Scribe v2 and Deepgram Nova-3 for production speech-to-text, focus on these decision points:

If you need real-time speaker labeling, ElevenLabs' realtime offering is a non-starter today.
If you need on-premises or private-cloud deployment, Deepgram is the practical option.
If your accuracy depends on niche terminology (especially alphanumerics), plan for customization or rigorous keyterm testing.
Price only stays predictable when your concurrency, retries, and retention requirements match the vendor's default assumptions.

What Changed When ElevenLabs Launched Scribe v2

ElevenLabs transcription is now a real STT option for engineering teams, but the product split (batch vs. realtime) creates different architectural constraints depending on your use case.

Scribe v2 vs. Scribe v2 Realtime: Two Different Use Cases

The distinction matters more than the naming suggests. ElevenLabs' STT docs describe a batch API designed for long-form audio processing (subtitles, archives, and compliance recordings) with speaker diarization (up to 32 speakers per the official documentation), word-level timestamps, and structured token metadata. Realtime targets live conversations with WebSocket streaming, but it does not support speaker diarization.

That limitation is more than cosmetic. If your downstream system needs "who said what" during a live call—agent coaching, dispute review, escalations, or compliance monitoring—you either accept that speaker labels arrive later (post-call), or build a parallel diarization track with a separate model, then align it to the transcript.

How Deepgram Nova-3 Is Positioned Against Scribe

Deepgram's Nova-3 is positioned around production STT workloads: streaming and batch parity, higher concurrency ceilings, and deployment flexibility (cloud, private environments, hybrid, or on-prem).

Where the practical gap shows up most often is not a single headline accuracy score, but whether your team can keep the same core transcription behavior as you move from prototype to regulated production—retention controls, networking constraints, and traffic spikes included.

Accuracy Under Real-World Audio Conditions

You should assume vendor-reported accuracy won't predict your production results. The only reliable comparison is a proof-of-concept on your own audio, with your own edge cases.

How Each Platform Performs on Telephony and Background Noise

There are no independent academic benchmarks directly comparing Scribe v2 and Nova-3 as of 2026, and controlled tests rarely match production audio paths. For telephony-grade 8kHz audio with codec artifacts and background noise, you need to test against your actual call recordings—not "clean" sample clips.

If you're doing a production go or no-go, don't treat STT as a single-number evaluation. Telephony deployments often fail on behavioral details:

Partial hypothesis behavior (how quickly and how steadily partial text arrives)
Endpointing (whether the model finalizes too aggressively and clips short responses)
Crosstalk handling (what happens when two people overlap)

Those issues drive downstream metrics like containment, average handle time, and escalation rate.

Domain-Specific Terminology: Medical, Legal, and Contact Center Jargon

This is where the platforms diverge structurally.

Deepgram offers customization options that can materially reduce Word Error Rate—including transfer learning and enterprise customization services—alongside self-serve adaptation via Keyterm Prompting for Nova-3. It's the difference between "it gets the gist" and "it reliably captures the tokens your automation depends on." ElevenLabs does not offer custom model training for Scribe; the practical tool you get is runtime keyterm prompting and entity detection.

The Five9 case study is a useful reminder that "domain" isn't only medical terms or legal citations. Five9 integrated Deepgram into their Intelligent Virtual Agent platform specifically because generic STT failed on alphanumeric inputs—order numbers, tracking IDs, member IDs, and policy numbers. In real-world tests, Deepgram was 2–4x more accurate than their previous STT option on those inputs, and a major healthcare customer doubled their user authentication rates as a direct result. For many contact centers, that's the critical path: if your automation breaks on alphanumeric tokens, you don't get a slightly worse transcript. You route to a human.

Custom Vocabulary and Keyterm Prompting Compared

Both platforms support keyterm prompting with similar limits (term count and length constraints). Operationally, what you need to test isn't only whether the term is recognized, but how often it creates false insertions and whether it destabilizes nearby words.

Two implementation details matter in practice:

Prompt scope: If you apply the same keyterm list to every stream, you can increase false positives across unrelated calls. A better pattern is to attach keyterms per tenant, per queue, or per intent.
Fallback behavior: If keyterms are essential (for example, product SKUs), you often need a validation loop: ask the user to confirm, and re-prompt only on low confidence.

ElevenLabs also offers optional entity detection for PII categories (including payment details). That can be useful, but it changes your architecture: you need to decide whether redaction happens inline before storage, or downstream as a separate processing step.

Latency and Concurrency at Production Scale

Latency and concurrency determine whether STT stays stable under load. This is where "works in a demo" turns into "survives Monday morning."

Latency Profiles: Realtime Streaming vs. Production Latency Budgets

Both platforms target real-time conversation, but engineering teams should separate model latency from system latency. TLS handshakes, reconnect behavior, regional routing, and audio chunk sizes can erase small model-level differences.

In a proof-of-concept, capture metrics that map to user experience and agent turn-taking:

Time-to-first-token
Partial update cadence
End-of-utterance finalization timing
Tail latency under load (P95/P99)

If you only score transcript accuracy, you miss the failure mode where your agent "hears correctly" but responds too late to feel conversational.

Concurrency Limits and What They Mean for Contact Centers

Concurrency is more than a rate limit—it's a failure mode.

If you hit a stream cap during a traffic spike, you don't get degraded transcription. You drop sessions, increase retries, and route more volume to human agents. That's why contact center teams test burst behavior (for example, peak arrival windows), queuing strategy, and backoff logic alongside accuracy.

Deepgram discloses concurrency and usage limits in its plan and pricing documentation, and also supports self-hosted scaling patterns for workloads that can't tolerate shared cloud constraints.

Batch Processing for High-Volume Workloads

Batch throughput becomes an infrastructure problem when you're processing thousands of hours per day. Even if the transcription engine is fast, your pipeline can still bottleneck on upload throughput, maximum file duration, reprocessing strategy, and timestamp determinism.

If your product depends on subtitles or downstream alignment, test timestamp stability across re-runs. In production, you'll eventually reprocess audio—model updates, bug fixes, or dispute workflows. If timestamps drift meaningfully, you can break subtitle tracks and any analytics keyed to word times.

Compliance and Deployment Architecture

For regulated teams, compliance is less about badges and more about where audio can flow, what can be retained, and what your auditors will ask for.

HIPAA Compliance Paths: BAA Process and Data Residency

ElevenLabs' HIPAA path is built around Zero Retention Mode—strongly recommended for PHI workflows—alongside BAAs and related controls. That can simplify risk, but it can also collide with real QA workflows where you need to retain a subset of transcripts for adjudication or medical record policies.

Deepgram supports HIPAA-aligned controls through self-hosted and VPC deployment options, which lets your organization configure data storage and retention independently rather than defaulting to a single zero-retention mode.

On-Premises vs. Cloud-Only Deployment Trade-offs

If you need to keep audio inside your own infrastructure (or inside a tightly controlled private environment), ElevenLabs' cloud-only Scribe offering is a blocker. Deepgram offers an on-prem deployment option, plus private routing patterns, for teams with air-gap requirements, strict data locality, or contracts that restrict vendor-operated processing.

Even outside regulated industries, deployment affects reliability. If your audio path originates inside a private network—SIP infrastructure, call recording systems, or a locked-down VPC—every extra hop increases failure surface area.

What Regulated Industries Need Before Technical Evaluation

Sequence matters here. Work through these in order before you run a model test:

Deployment model: Where does audio processing happen physically?
Retention controls: What is stored, for how long, and who can access it?
Security review artifacts: What logs, diagrams, and access controls will your auditors require?
Only then accuracy: Because blocked procurement means you never reach the model test.

A practical tip: document your "happy path" and your "incident path." Security reviews often focus on what happens during outages, retries, and support escalations—not only on normal operation.

Pricing Transparency and Cost at Scale

Pricing only matters when it matches how your system behaves under load: retries, bursts, multi-tenant metering, and retention workflows all show up on the bill.

Cost Comparison at Common Production Volumes

ElevenLabs does not publish a granular per-hour rate table for Scribe v2 batch—pricing is typically behind sign-up or a sales conversation. Deepgram publishes its Nova-3 rates openly. Based on Deepgram's pricing page, Nova-3 Monolingual runs at $0.0077/min on Pay-As-You-Go:

Both platforms offer enterprise discounts that can significantly change effective rates. Run a volume estimate against your actual usage profile before comparing sticker prices.

Billing Models and Predictability

ElevenLabs bills through subscription tiers with included hour allocations, then overages. Deepgram bills usage-based per audio minute, with volume and plan discounts available.

If you run a multi-tenant platform, the billing model affects your product design:

Subscription-hour models reward steady utilization, but can punish bursty tenants (where you hit overage rates quickly).
Per-minute usage billing maps cleanly to metered plans, quota enforcement, and pass-through pricing.

A practical forecasting approach is to model steady-state and peak months separately. Most surprises come from peak traffic plus retry storms, not from normal weeks.

The Bundled Pricing Advantage for Voice Agents

If you're building full voice agents, total system cost often gets decided by everything around STT: retries, reconnections, partial transcripts you pay for but never use, and whether you run parallel streams for analytics.

Deepgram's Voice Agent API uses bundled pricing (~$4.50/hour covering STT, LLM, and TTS) that avoids LLM pass-through surprises when you're trying to keep unit economics stable across tenants.

Choosing the Right STT API for Your Use Case

The cleanest decision rule is this: ElevenLabs Scribe v2 is strongest for content production and multilingual batch workflows; Deepgram Nova-3 is built for high-volume deployments where concurrency and deployment flexibility are non-negotiable.

Use Case Decision Matrix

*"Strong fit" reflects documented feature support and common production constraints; "Acceptable" reflects functional but not ideal; "Not recommended" reflects documented architectural limitations.*

Start Testing with Deepgram

If you want to validate this on your own audio, use the Deepgram Console to run a short proof-of-concept and compare accuracy, endpointing behavior, and cost under load. Get $200 in free credits to test Nova-3—no credit card required.

FAQ

How Should You Run a Fair STT Proof-of-Concept?

Use the same audio set, normalize sample rates, and keep post-processing identical (punctuation, formatting, redaction). Score accuracy by segment type (IDs, names, addresses), not only aggregate WER, and record operational metrics like reconnect frequency and end-of-utterance behavior.

What If You Need Realtime Diarization?

Start by checking whether you can avoid ML diarization entirely: many contact centers can capture dual-channel audio (agent and customer on separate tracks), which gives you speaker labeling in real time with near-zero extra latency.

If you must diarize mixed audio live, plan for added delay. Most pipelines run streaming STT for words, run a rolling diarization window in parallel (for example, with PyAnnote Audio), then join the two using word timestamps plus diarization segments (often exported as RTTM-like intervals). In practice, you'll usually trade 1 to 3 seconds of extra lag for more stable speaker turns, and you'll need smoothing logic to prevent speaker "flip-flops" mid-utterance.

What Breaks Multi-Tenant Cost Predictability?

Retries and bursts. Add explicit budget guardrails per tenant (quotas, circuit breakers, and queueing), and measure "paid audio seconds" versus "useful transcript seconds." The delta between those numbers is where surprise margin loss shows up.

What Do HIPAA Security Reviews Commonly Ask For?

Auditors often focus on retention and support access, not just encryption. Have ready: data-flow diagrams, incident procedures, and proof of how you prevent PHI from landing in logs, debug traces, or downstream analytics stores.

When Does On-Premises STT Beat Private Cloud?

Choose on-prem when contracts or policy require full data locality, when your telephony stack lives inside a restricted network, or when you need deterministic capacity planning without shared-cloud concurrency ceilings. Private cloud is often faster to procure, but still introduces vendor-operated infrastructure risk.

Listen to article12:18

Key Takeaways
What Changed When ElevenLabs Launched Scribe v2
Scribe v2 vs. Scribe v2 Realtime: Two Different Use Cases
How Deepgram Nova-3 Is Positioned Against Scribe
Accuracy Under Real-World Audio Conditions
How Each Platform Performs on Telephony and Background Noise
Domain-Specific Terminology: Medical, Legal, and Contact Center Jargon
Custom Vocabulary and Keyterm Prompting Compared
Latency and Concurrency at Production Scale
Latency Profiles: Realtime Streaming vs. Production Latency Budgets
Concurrency Limits and What They Mean for Contact Centers
Batch Processing for High-Volume Workloads
Compliance and Deployment Architecture
HIPAA Compliance Paths: BAA Process and Data Residency
On-Premises vs. Cloud-Only Deployment Trade-offs
What Regulated Industries Need Before Technical Evaluation
Pricing Transparency and Cost at Scale
Cost Comparison at Common Production Volumes
Billing Models and Predictability
The Bundled Pricing Advantage for Voice Agents
Choosing the Right STT API for Your Use Case
Use Case Decision Matrix
Start Testing with Deepgram
FAQ
How Should You Run a Fair STT Proof-of-Concept?
What If You Need Realtime Diarization?
What Breaks Multi-Tenant Cost Predictability?
What Do HIPAA Security Reviews Commonly Ask For?
When Does On-Premises STT Beat Private Cloud?

Listen to article12:18

Key Takeaways

If you're choosing between ElevenLabs Scribe v2 and Deepgram Nova-3 for production speech-to-text, focus on these decision points:

If you need real-time speaker labeling, ElevenLabs' realtime offering is a non-starter today.
If you need on-premises or private-cloud deployment, Deepgram is the practical option.
If your accuracy depends on niche terminology (especially alphanumerics), plan for customization or rigorous keyterm testing.
Price only stays predictable when your concurrency, retries, and retention requirements match the vendor's default assumptions.

What Changed When ElevenLabs Launched Scribe v2

ElevenLabs transcription is now a real STT option for engineering teams, but the product split (batch vs. realtime) creates different architectural constraints depending on your use case.

Scribe v2 vs. Scribe v2 Realtime: Two Different Use Cases

How Deepgram Nova-3 Is Positioned Against Scribe

Deepgram's Nova-3 is positioned around production STT workloads: streaming and batch parity, higher concurrency ceilings, and deployment flexibility (cloud, private environments, hybrid, or on-prem).

Accuracy Under Real-World Audio Conditions

You should assume vendor-reported accuracy won't predict your production results. The only reliable comparison is a proof-of-concept on your own audio, with your own edge cases.

How Each Platform Performs on Telephony and Background Noise

If you're doing a production go or no-go, don't treat STT as a single-number evaluation. Telephony deployments often fail on behavioral details:

Partial hypothesis behavior (how quickly and how steadily partial text arrives)
Endpointing (whether the model finalizes too aggressively and clips short responses)
Crosstalk handling (what happens when two people overlap)

Those issues drive downstream metrics like containment, average handle time, and escalation rate.

Domain-Specific Terminology: Medical, Legal, and Contact Center Jargon

This is where the platforms diverge structurally.

Custom Vocabulary and Keyterm Prompting Compared

Two implementation details matter in practice:

Prompt scope: If you apply the same keyterm list to every stream, you can increase false positives across unrelated calls. A better pattern is to attach keyterms per tenant, per queue, or per intent.
Fallback behavior: If keyterms are essential (for example, product SKUs), you often need a validation loop: ask the user to confirm, and re-prompt only on low confidence.

Latency and Concurrency at Production Scale

Latency and concurrency determine whether STT stays stable under load. This is where "works in a demo" turns into "survives Monday morning."

Latency Profiles: Realtime Streaming vs. Production Latency Budgets

In a proof-of-concept, capture metrics that map to user experience and agent turn-taking:

Time-to-first-token
Partial update cadence
End-of-utterance finalization timing
Tail latency under load (P95/P99)

If you only score transcript accuracy, you miss the failure mode where your agent "hears correctly" but responds too late to feel conversational.

Concurrency Limits and What They Mean for Contact Centers

Concurrency is more than a rate limit—it's a failure mode.

Deepgram discloses concurrency and usage limits in its plan and pricing documentation, and also supports self-hosted scaling patterns for workloads that can't tolerate shared cloud constraints.

Batch Processing for High-Volume Workloads

Compliance and Deployment Architecture

For regulated teams, compliance is less about badges and more about where audio can flow, what can be retained, and what your auditors will ask for.

HIPAA Compliance Paths: BAA Process and Data Residency

On-Premises vs. Cloud-Only Deployment Trade-offs

What Regulated Industries Need Before Technical Evaluation

Sequence matters here. Work through these in order before you run a model test:

Deployment model: Where does audio processing happen physically?
Retention controls: What is stored, for how long, and who can access it?
Security review artifacts: What logs, diagrams, and access controls will your auditors require?
Only then accuracy: Because blocked procurement means you never reach the model test.

A practical tip: document your "happy path" and your "incident path." Security reviews often focus on what happens during outages, retries, and support escalations—not only on normal operation.

Pricing Transparency and Cost at Scale

Pricing only matters when it matches how your system behaves under load: retries, bursts, multi-tenant metering, and retention workflows all show up on the bill.

Cost Comparison at Common Production Volumes

Both platforms offer enterprise discounts that can significantly change effective rates. Run a volume estimate against your actual usage profile before comparing sticker prices.