Build a Real-Time ElevenLabs Voice Agent: 2026 Guide

Listen to article14:35

Key Takeaways
What ElevenLabs' Voice Agent Platform Actually Does
The Modular Pipeline Architecture
What the Conversational AI 2.0 Platform Includes
Where the Platform Ends and Your Stack Begins
The STT Layer: Where Most Voice Agents Break First
Scribe Is Not a Real-Time Model
Streaming STT for Live Agents: What ElevenLabs Offers
Endpointing: The Hidden Variable in Agent Latency
Endpointing and Turn Detection: The Bottleneck Nobody Demos
Why Silence-Based VAD Fails in Production
ElevenLabs' Turn-Taking Model and Its Tradeoffs
What Tight Coupling Between STT and Endpointing Gains You
Total Latency Budget: Adding Up the Real Numbers
STT Plus LLM Plus TTS: How the Math Compounds
ElevenLabs' Claimed Versus Measured Latency
When the Latency Budget Runs Out
Production Constraints Worth Knowing Before You Deploy
Concurrency Limits and What They Mean for Scale
Multilingual and Specialty Terminology Handling
Compliance and Deployment Options at Enterprise Tier
Choosing the Right Architecture for Your Voice Agent
When ElevenLabs' Integrated Platform Is the Right Call
When to Decouple STT from TTS
A Production Reference Point: Five9-Style Self-Service Economics
Get Started with Deepgram
FAQ
Can You Use ElevenLabs Voice Agents for Phone Calls?
What Is ElevenLabs Scribe and Can It Be Used for Real-Time Voice Agents?
How Does ElevenLabs Handle Callers Who Switch Languages Mid-Conversation?
Can I Use ElevenLabs TTS with a Different STT Provider?
Can ElevenLabs Voice Agents Be Deployed On-Premises for HIPAA Compliance?

Listen to article14:35

Yes, you can build a real-time ElevenLabs voice agent, and you can probably get a working demo running in a day. The production risk is that small timing problems quickly turn into real cost: using the U.S. Bureau of Labor Statistics median pay for customer service representatives ($19.08/hour), an extra 10 seconds of avoidable handle time across 10 million calls in a year is roughly $530,000 in wages alone (10M calls × 10s = 27,778 hours), before benefits, occupancy, or vendor overhead (BLS OES 43-4051). This article helps you decide whether ElevenLabs' integrated stack will hold up under your audio conditions and call volumes, or whether you should decouple STT and endpointing to protect latency, containment, and user experience.

Key Takeaways

Key evaluation points for engineering teams:

ElevenLabs' integrated platform handles STT, LLM, and TTS in a single session with telephony, tool calling, and RAG out of the box.
Scribe v2 Realtime reports top-tier accuracy on independent WER benchmarks, but vendor-neutral streaming latency data is limited.
Turn detection uses hybrid VAD plus deep learning, but no false interruption rates are published for noisy environments.
Concurrent calls cap at 30 on Scale and Business tiers; higher requires Enterprise.
For high-volume contact centers or noisy and bilingual audio, decoupling STT from TTS gives more control.

What ElevenLabs' Voice Agent Platform Actually Does

ElevenLabs' voice agent stack is fastest when you want one provider to own the whole loop: audio in, text and tool calls in the middle, audio out. The real question is whether that chain holds under production conditions where audio quality, concurrency, and conversation dynamics are unpredictable.

The Modular Pipeline Architecture

ElevenLabs Conversational AI connects separate STT, LLM, and TTS components into a single session. The platform handles turn detection, session management, and tool calling natively. For telephony, it supports common providers and SIP-based integrations, so you don't have to stitch the lifecycle together yourself.

What the Conversational AI 2.0 Platform Includes

Conversational AI 2.0 added features that matter for production deployments: a custom turn-taking model that goes beyond simple silence detection, a RAG knowledge base supporting text, URL, and file ingestion, and batch calling APIs for outbound campaigns with per-recipient personalization. The platform also supports OAuth2 and API key authentication for server-side tool calling, letting agents trigger backend functions mid-conversation.

Where the Platform Ends and Your Stack Begins

ElevenLabs doesn't offer model customization for specialized terminology in its STT layer. There's no true on-premises deployment; enterprise options are limited to cloud and private-cloud style deployments. And while the platform's STT works well on clean audio, its behavior under telephony-grade noise, with background chatter, speakerphone echo, or cellular compression, lacks published performance metrics. That gap matters most for the teams who need this platform the most: contact centers.

The STT Layer: Where Most Voice Agents Break First

If you're debugging a voice agent that "feels slow" or "keeps misunderstanding people," STT is usually where the first cracks show. ElevenLabs' batch transcription model delivers strong accuracy on pre-recorded audio, but live voice agents require a streaming model, specifically Scribe v2 Realtime.

Scribe Is Not a Real-Time Model

ElevenLabs' Scribe batch transcription model is architecturally distinct from Scribe v2 Realtime. The batch model processes pre-recorded audio files and can't be plugged into a live agent session; the Conversational AI platform automatically uses the streaming variant.

Streaming STT for Live Agents: What ElevenLabs Offers

Scribe v2 Realtime ranks at the top of the AA-WER v2 evaluation. That makes it a serious option on accuracy.

But accuracy scores like WER don't tell the full story for voice agents. Vendor-neutral, reproducible streaming latency benchmarks for Scribe v2 Realtime are limited, and streaming latency is what determines whether your agent feels responsive or sluggish.

Endpointing: The Hidden Variable in Agent Latency

Accurate transcription alone doesn't make a responsive agent. The agent also needs to know when the caller has finished speaking. This decision point, called endpointing, determines how quickly the agent can start generating a response.

Even if STT processes speech in under 200ms, a conservative endpointing threshold that waits too long for silence can add hundreds of milliseconds before the LLM even receives the transcript. Endpointing is where transcription accuracy turns into conversational timing, and it's the transition point where most voice agents quietly fall apart.

Endpointing and Turn Detection: The Bottleneck Nobody Demos

Turn detection decides whether your agent feels interruptible and human, or like a voicemail system with a nice voice. Endpointing is the difference between a snappy agent and one that feels like it's always half a beat late.

Why Silence-Based VAD Fails in Production

Traditional voice activity detection (VAD) works by measuring silence: if the caller stops producing sound for a set threshold (say, 500ms), the system assumes they're done. This breaks in two predictable ways. First, callers pause mid-sentence to think, and the system cuts them off. Second, background noise from a drive-thru speaker, an office speakerphone, or cellular compression fills the silence gap, so the system never detects end-of-turn. Both failure modes feel terrible to the caller: either they get interrupted, or they sit waiting while the agent does nothing.

ElevenLabs' Turn-Taking Model and Its Tradeoffs

ElevenLabs' Conversational AI 2.0 uses a hybrid system combining VAD with deep learning models that analyze filler words, prosody, speech rhythm, and micro-pauses. You can configure turn eagerness across three settings (Eager, Normal, Patient) and adjust turn timeout between 1 and 30 seconds.

The architecture is more sophisticated than raw silence detection. But ElevenLabs publishes no quantitative metrics on false interruption rates, no precision/recall data for turn-taking decisions, and no performance benchmarks specific to telephony or high-noise environments.

What Tight Coupling Between STT and Endpointing Gains You

When STT and endpointing run as separate pipeline stages, every handoff between them adds latency and information loss. Fusing transcription and end-of-turn detection into a single model lets the system use semantic completeness, not just acoustic silence, to decide when a speaker is done.

Deepgram's Voice Agent API takes this approach, using semantic completeness rather than silence thresholds to detect turn boundaries. The practical result is fewer false interruptions in noisy environments without sacrificing response speed.

Total Latency Budget: Adding Up the Real Numbers

If your agent feels "off," the cause is usually cumulative delay across the pipeline, not one slow component. A voice agent's perceptible delay is the sum of every component.

STT Plus LLM Plus TTS: How the Math Compounds

Here's a conservative walkthrough. STT processing takes 150-200ms. Endpointing detection adds its own delay. The LLM needs 200-400ms for time to first token. TTS generation adds another 75-135ms. Even before network overhead, you're looking at roughly 500ms under optimal conditions.

The VAQI benchmark measured ElevenLabs at approximately 530ms in controlled environments, which aligns with that math.

ElevenLabs' Claimed Versus Measured Latency

Measured latency varies widely by geography, network path, and implementation details, so third-party measurements matter more than per-component specs.

If you're trying to predict production feel from a demo, the most dependable move is simple: measure what your users experience from the same regions and networks they use.

When the Latency Budget Runs Out

Conversational design research generally places the threshold for natural-feeling responses under 300ms, with 300–600ms acceptable and anything over 1 second feeling robotic (see VAQI for measurement framing). If your ElevenLabs voice agent is already hitting 500ms+ in controlled conditions, production deployments with real network hops and concurrent load will push past that threshold for many callers.

One practical way to de-risk this is to measure complete pipeline timing the way your users feel it, not how vendors report it. Track (1) last-user-audio to first-agent-audio, (2) barge-in success rate (how often a user can interrupt cleanly), and (3) p95 and p99 tail latency during load tests, not just median. Those three numbers tend to expose endpointing conservatism, queuing behavior, and cross-region routing issues earlier than transcript-level accuracy metrics.

To make those metrics actionable (and debuggable), treat each call like a trace. Assign a correlation ID per call, then log monotonic timestamps for the same lifecycle points on every request: ts_last_user_audio_ingest, ts_stt_final, ts_llm_first_token, ts_tts_first_chunk, and ts_first_agent_audio_sent. When a user barges in, log a separate event_barge_in marker plus whether you canceled playback successfully or kept streaming. That one extra row of structured logs often turns "it feels slow" into a precise diagnosis like "endpointing waited" or "TTS started fast but audio egress queued," which is exactly the difference between a quick fix and a week of guessing.

Production Constraints Worth Knowing Before You Deploy

ElevenLabs is a solid choice for moderate-volume deployments, but the limits that show up at scale are concrete and sometimes surprising. The constraints that matter most are concurrent call limits, language-switching behavior, and the absence of true on-premises deployment.

Concurrency Limits and What They Mean for Scale

ElevenLabs' official concurrency limits list concurrent call limits by plan: Free (4), Starter (6), Creator (10), Pro (20), Scale (30), and Business (30). Exceeding your limit returns HTTP 429 with a concurrent_limit_exceeded error. Teams needing more than 30 simultaneous calls must negotiate an Enterprise contract.

In the real world, concurrency requirements can jump from "a few pilot calls" to "enterprise peaks" quickly. In that world, a 30-call cap is not just a pricing detail; it changes whether you design around throttling and queuing, or whether you pick infrastructure that is already built for high concurrency.

For comparison, Deepgram's Speech-to-Text infrastructure is built for high-volume telephony workloads, including 140,000-plus simultaneous calls.

Multilingual and Specialty Terminology Handling

ElevenLabs' TTS fixes language per call. You can't send mixed-language text like "Hello, ¿cómo estás?" and get natural code-switching within a single utterance. Bilingual conversations require your application to detect the language, segment the text, and make separate synthesis calls.

Entity pronunciation also often requires manual text normalization before sending to the API (for example, forcing how you want product SKUs, initials, or addresses spoken).

Compliance and Deployment Options at Enterprise Tier

ElevenLabs offers HIPAA-eligible configurations with BAAs and common security certifications, plus EU data residency endpoints. Zero Retention Mode processes data exclusively in volatile memory and is activated per API call via enable_logging=false.

What's not available is true on-premises hardware deployment. Organizations requiring air-gapped systems need a different provider.

Choosing the Right Architecture for Your Voice Agent

Your architecture choice comes down to one practical question: do you want one provider to own the whole speech loop, or do you need control over STT and endpointing while keeping ElevenLabs for voice quality? That answer is usually dictated by your audio conditions and your concurrency requirements.

When ElevenLabs' Integrated Platform Is the Right Call

ElevenLabs as a complete platform makes sense when you're prioritizing voice expressiveness and fast deployment. If your agent handles standard English in controlled audio conditions, runs within your plan's concurrency limits, and voice quality is a key differentiator, the integrated platform gets you to production quickly. The built-in RAG, tool calling, and telephony integrations mean less glue code.

When to Decouple STT from TTS

A decoupled architecture delivers better results in high-volume contact centers, noisy telephony audio, bilingual or code-switching callers, and compliance scenarios requiring on-premises processing. In these scenarios, pairing a purpose-built STT layer with ElevenLabs' TTS gives you the control you need.

A Production Reference Point: Five9-Style Self-Service Economics

If you want a concrete signal for whether "integrated and convenient" is enough, look at deployments where self-service is the product, not a demo feature. The Five9 case study describes voice workflows where improving recognition quality can move real business outcomes, including doubled user authentication rates and increased self-service success.

That type of workload tends to stress exactly the things that are hard to infer from an ElevenLabs demo:

Noisy, compressed audio: authentication and intent capture often happen over telephony-grade audio where background noise and packet loss are normal, not edge cases.
Turn-taking sensitivity: callers hesitate while looking for account info; endpointing that is too eager creates false interruptions, while endpointing that is too patient slows the experience and increases abandonments.
Scale and variance: peak traffic is unpredictable, and you have to budget for tail latency and throttling behavior, not just median performance.

In other words, once you are in "self-service success" territory, you are not just optimizing for WER. You are optimizing for overall containment and for a turn-by-turn experience that does not force a spillover to a human agent.

Get Started with Deepgram

Deepgram's Voice Agent API is built for production voice agents where STT accuracy, endpointing reliability, and orchestration need to work together under real-world conditions. The API supports bring-your-own-TTS, so you can use ElevenLabs' voices for expressiveness while Deepgram handles transcription and turn detection.

If you want the full stack from Deepgram, you can also use our Text-to-Speech voices for professional clarity.

Ready to test it with your own audio? Use the Deepgram Console and Try the Console with $200 in free credits, no credit card required.

FAQ

These quick answers cover the deployment questions that usually come up after a first demo: telephony quirks, what "realtime" STT means in practice, language switching, mixing providers, and what HIPAA teams usually miss.

Can You Use ElevenLabs Voice Agents for Phone Calls?

Yes, but plan for telephony details you won’t see in a browser demo: 8 kHz audio (often μ-law), packet loss, and talk-over. Also decide how you’ll handle DTMF ("press 1") and early hangups: your app should treat a dropped media stream as a first-class state transition and stop TTS immediately.

What Is ElevenLabs Scribe and Can It Be Used for Real-Time Voice Agents?

Scribe (batch) is for files; agents use the realtime streaming model. In implementation terms, what matters is how partial hypotheses behave: you’ll want to ignore unstable interim text, wait for a “final” segment boundary, and keep your tool-calling logic from triggering twice when the transcript revises itself.

How Does ElevenLabs Handle Callers Who Switch Languages Mid-Conversation?

Expect to build routing logic. A common pattern is: detect language per user turn, store two conversation states (one per language), then synthesize per-language responses as separate TTS calls. Also decide what to do on mixed-language turns: either pick a dominant language or fall back to spelling mode for named entities.

Can I Use ElevenLabs TTS with a Different STT Provider?

Yes, and it’s often the cleanest architecture. The main extra work is barge-in: you need a “playback cancel” path so that the moment new user audio arrives, you stop streaming agent audio, flush buffers, and restart TTS on the next response to avoid talking over the caller.

Can ElevenLabs Voice Agents Be Deployed On-Premises for HIPAA Compliance?

Not on physical on-prem hardware. In practice, HIPAA reviews usually hinge on boundaries: where audio is decrypted, whether transcripts or recordings are stored, and what lands in logs. Even with a BAA and Zero Retention Mode, you should verify your own call recording, observability, and error logging don’t accidentally persist PHI outside your approved system of record.

Listen to article14:35

Key Takeaways
What ElevenLabs' Voice Agent Platform Actually Does
The Modular Pipeline Architecture
What the Conversational AI 2.0 Platform Includes
Where the Platform Ends and Your Stack Begins
The STT Layer: Where Most Voice Agents Break First
Scribe Is Not a Real-Time Model
Streaming STT for Live Agents: What ElevenLabs Offers
Endpointing: The Hidden Variable in Agent Latency
Endpointing and Turn Detection: The Bottleneck Nobody Demos
Why Silence-Based VAD Fails in Production
ElevenLabs' Turn-Taking Model and Its Tradeoffs
What Tight Coupling Between STT and Endpointing Gains You
Total Latency Budget: Adding Up the Real Numbers
STT Plus LLM Plus TTS: How the Math Compounds
ElevenLabs' Claimed Versus Measured Latency
When the Latency Budget Runs Out
Production Constraints Worth Knowing Before You Deploy
Concurrency Limits and What They Mean for Scale
Multilingual and Specialty Terminology Handling
Compliance and Deployment Options at Enterprise Tier
Choosing the Right Architecture for Your Voice Agent
When ElevenLabs' Integrated Platform Is the Right Call
When to Decouple STT from TTS
A Production Reference Point: Five9-Style Self-Service Economics
Get Started with Deepgram
FAQ
Can You Use ElevenLabs Voice Agents for Phone Calls?
What Is ElevenLabs Scribe and Can It Be Used for Real-Time Voice Agents?
How Does ElevenLabs Handle Callers Who Switch Languages Mid-Conversation?
Can I Use ElevenLabs TTS with a Different STT Provider?
Can ElevenLabs Voice Agents Be Deployed On-Premises for HIPAA Compliance?

Listen to article14:35

Key Takeaways

Key evaluation points for engineering teams:

ElevenLabs' integrated platform handles STT, LLM, and TTS in a single session with telephony, tool calling, and RAG out of the box.
Scribe v2 Realtime reports top-tier accuracy on independent WER benchmarks, but vendor-neutral streaming latency data is limited.
Turn detection uses hybrid VAD plus deep learning, but no false interruption rates are published for noisy environments.
Concurrent calls cap at 30 on Scale and Business tiers; higher requires Enterprise.
For high-volume contact centers or noisy and bilingual audio, decoupling STT from TTS gives more control.