Yes, ElevenLabs offers speech-to-text through its Scribe model, but it started as a secondary product on a voice synthesis platform, and that origin shapes what Scribe can and can't do in production. An ACM study on commercial ASR services shows real-world accuracy varies meaningfully across providers depending on audio conditions, and that gap has direct cost implications at production scale.
This article helps you evaluate whether Scribe fits your stack or whether a dedicated layer like Deepgram speech-to-text is a better match.
Key Takeaways
Scribe can work well inside an ElevenLabs-first stack, but shared credits, shared concurrency, and sales-gated compliance are the constraints that tend to show up only after you ship.
- Scribe v2 handles batch processing with diarization; Scribe v2 Realtime handles streaming without it.
- Concurrency and credits are shared across all ElevenLabs services, so heavy TTS usage directly reduces your available STT capacity.
- HIPAA compliance requires an Enterprise tier subscription and a sales-negotiated BAA; it isn't self-serve.
- On-premises deployment for STT has no confirmed availability; announced on-prem options reference TTS-specific language.
What ElevenLabs Actually Offers for Speech-to-Text
ElevenLabs speech to text is convenient if you already live in their ecosystem, but STT usage shares credits and concurrency with the rest of the platform.
Scribe v2 And Scribe v2 Realtime: What Each Does
ElevenLabs ships two distinct STT products. Scribe docs cover Scribe v2 (batch) and Scribe v2 Realtime (streaming), including limits like max upload sizes, supported languages, and word-level timestamps.
The key trade-off: Scribe v2 Realtime prioritizes low latency by omitting diarization. If your application needs real-time speaker labeling, the streaming product won't meet that requirement. The common workaround is a dual-pass design: streaming for live agent actions and barge-in detection, batch for post-call analytics and QA.
Where ElevenLabs STT Fits in Its Platform
Scribe sits inside a platform whose primary revenue driver is voice generation. TTS, voice cloning, and AI agents are the core products; STT is the supporting player. Your STT capacity isn't isolated: it competes with every other service on your account. That matters for capacity planning: you can't reason about STT headroom without also accounting for TTS, voice cloning, and any other features your team ships. If you're running STT and TTS in the same request path, rate limiting can show up as "my agent got quiet" and "my agent stopped listening" at the same time.
Supported Languages, Diarization, And Audio Features
Scribe supports multilingual transcription with automatic language detection. The batch product supports diarization and multichannel audio, but not simultaneously; you have to pick one. If your upstream system produces separate channels per participant (dual-channel call recording, for example), multichannel can be a cleaner alternative to diarization. If you only have a mixed mono track, diarization is usually the better bet. For production evaluation, test beyond basic transcription: numbers and identifiers (account IDs, phone numbers) are where downstream workflows fail first; timestamps need to stay stable when the model revises earlier hypotheses; turn boundaries need to match your silence-handling assumptions.
How ElevenLabs STT Performs in Production
Scribe can look great on clean audio, but production outcomes depend on your actual conditions and shared-capacity behavior.
Accuracy Claims And What Benchmarks Cover
Vendor benchmarks have real limits: run your own audio before committing. ElevenLabs publishes accuracy claims for Scribe, but those numbers come from the vendor itself, not an independent source. The ACM study linked above evaluated multiple commercial ASR services on challenging audio with background noise and multiple speakers, and found wide spreads in Word Error Rate. Small deltas matter a lot downstream. A few percentage points of WER on alphanumeric data like account numbers or claim IDs can break entire downstream workflows.
Latency: What To Measure in a Live Agent Pipeline
Published latency numbers are a rough benchmark; measure your own pipeline before optimizing. Treat latency as a budget: measure time to first partial transcript, time to stable finalization, and how often partial hypotheses get revised. Revisions are where "fast" STT can still break agent logic.
It's worth breaking "STT latency" into the pieces you can actually tune: audio chunking and buffering (larger chunks reduce overhead but increase perceived delay), endpointing and turn detection (aggressive endpointing speeds up responses but can cut users off), and network jitter handling (reconnect behavior can create transcript bursts that confuse downstream logic). You want end-to-end timing from microphone frame to the moment your app is confident enough to act, not just the provider's published number.
The Shared Concurrency Problem
Concurrency and rate limiting are shared across services in the ElevenLabs account. If your product runs heavy TTS during peak periods, available STT throughput can drop exactly when you need it most. Plan for queues, backpressure, and graceful fallbacks early. A few patterns that reduce the blast radius: isolate critical STT paths from non-critical transcription jobs (like post-call analytics) with separate queues; enforce your own per-feature rate caps so a TTS experiment can't starve your transcription path; and always test mixed STT-plus-TTS traffic under realistic load, not just STT alone.
Where ElevenLabs STT Falls Short for Enterprise Deployments
For regulated or high-volume workloads, the biggest risks are compliance onboarding and unclear deployment options.
Unclear On-Premises Coverage
ElevenLabs has discussed on-premises and private-cloud options, but STT coverage isn't consistently documented. If your environment requires audio to stay within a controlled network (common in healthcare, financial services, and some government workloads), confirm in writing whether Scribe specifically is supported in the deployment model you need. Get clarity on the operational details: how updates are delivered, what gets logged and where, and what the incident response process looks like when audio processing is on your infrastructure.
HIPAA Compliance Is Sales-Gated
ElevenLabs supports HIPAA workflows, but the path isn't self-serve. The HIPAA doc describes the Enterprise tier and BAA requirements. That sales-gated step can become your critical path: procurement timelines for healthcare enterprise deals often run 6–12 weeks once compliance reviews start. Ask early about BAA timelines, confirm what data is retained by default during normal operation and for debugging, and verify what configuration changes are required to meet your organization's specific audit requirements.
Choosing the Right STT for Your Production Stack
Scribe is typically best when transcription supports an ElevenLabs-centered voice stack. When transcription is the product, dedicated STT infrastructure tends to be lower risk.
Use Scribe if you're building primarily for voice generation and need transcription as a supporting feature: captioning generated audio, transcribing content for accessibility, or prototyping a voice agent within a single ecosystem. Keeping everything under one vendor reduces integration overhead and simplifies billing during early-stage development. The trade-offs become more relevant as you scale, not on day one.
Use dedicated STT when transcription is core to your product. The constraints of a TTS-first platform (shared capacity, opaque limits, cross-service billing) start to matter more when transcription failures show up as "your product is down" rather than "the transcript is a bit off." The Sharpen case reports over 90% accuracy and an 8x cost reduction after moving to dedicated STT infrastructure. Running Deepgram for STT and ElevenLabs for voice output is a common split that isolates failure modes and eliminates shared-pool contention.
ElevenLabs STT Pricing Compared to Dedicated Providers
ElevenLabs pricing can look inexpensive on paper, but forecasting depends on shared credits and how other services consume the same pool. ElevenLabs lists STT rates on its Pricing page, but billing flows through a shared credit system. A month of heavy voice generation can push STT into overage pricing. Model "credit contention" during your POC: forecast peak-hour TTS usage, reserve a buffer for transcription, and verify the split holds under load before you commit to contractual SLAs.
Deepgram pricing uses transparent per-minute billing isolated to STT. Nova-3 is $0.0043/min for pre-recorded audio. Your transcription bill doesn't fluctuate when another team runs a big TTS experiment.
How to Evaluate STT for Your Production Stack
A solid POC answers five questions before you commit:
- Transcript quality on your audio: test real samples from your worst conditions: overlaps, speakerphones, background noise, accented speech.
- Full-pipeline streaming behavior: measure wall-clock latency from microphone frame to token arrival, and how often partial hypotheses get revised.
- Limit handling under load: push concurrency until you see throttling and confirm your backoff strategy holds.
- Shared-credit interference: run TTS spiking during your peak STT window to find out whether you need isolation.
- Data handling and retention: confirm what gets stored, what's logged, and what controls apply to regulated data flows.
A POC that answers these questions gives you the information you need to design retries, size queues, and forecast costs rather than discovering the answers in production.
Get Started With Deepgram
If you want to test how a dedicated STT provider handles your specific audio conditions, latency requirements, and concurrency needs, use the Deepgram Console to get $200 in free credits, no credit card required. Run your audio through Nova-3, compare accuracy and streaming behavior against your current setup, and see whether isolated STT infrastructure changes the math for your production workload.
FAQ
Is ElevenLabs Speech-to-Text Free?
Free enough for smoke tests, not realistic load. ElevenLabs offers a free tier, but shared credits deplete quickly once you're running both TTS and STT at the same time. Keep evaluations representative: include your worst audio conditions, run concurrent requests at the volume you actually expect, and track how quickly the shared credit pool drains before you start forecasting production costs.
How Accurate Is ElevenLabs Scribe Compared to Deepgram?
Score both on your own audio. Build a small gold set, create human reference transcripts, and compare Word Error Rate, digit error rate, and diarization stability. Test downstream impact: does punctuation or speaker attribution change your analytics or agent decisions?
Does ElevenLabs STT Support Real-Time Transcription, and What Languages Does It Cover?
Yes, but validate streaming behavior beyond "it works." Check how it handles interruptions (barge-in), packet loss, and reconnects, since these are where real-time STT breaks in live agent pipelines. For multilingual calls, test dialect edge cases and consider forcing language hints rather than relying on auto-detect, especially when the first few seconds include greetings, hold music, or IVR beeps that can confuse language detection.
Can ElevenLabs Speech-to-Text Handle HIPAA-Compliant Workloads?
Treat it as a security workflow, not a feature. Ask for a clear data flow covering audio, transcripts, and logs; retention defaults; deletion controls; and support access policies. The Vida Health case is a concrete reference for what production HIPAA STT looks like.
What Audio Formats Does ElevenLabs Scribe Support?
The bigger risk is preprocessing, not file extensions. Standardize your ingest early: consistent sample rates, correct channel mapping, and predictable loudness levels. For telephony audio, verify your μ-law to PCM conversion is correct and avoid double-compression artifacts. For streaming, test different chunk sizes and buffering configurations. "Valid audio" can still produce worse accuracy if frames arrive irregularly or with inconsistent timing.

