Table of Contents
Top APIs for Building Programmable Voice Agents: What the Stack Looks Like in Production
Building a production voice agent means choosing providers across four layers: speech-to-text, text-to-speech, LLM orchestration, and telephony. In production, providers differ sharply by layer coverage, latency, cost, and compliance terms.
Eight providers cover different parts of the production stack, and most still require external components. The choices you make here shape latency, cost, and compliance posture.
Key Takeaways
Every provider on this list forces tradeoffs between layer coverage, bundling, and compliance. Here's how those choices play out across the eight providers:
- No single provider on this list covers all four layers natively. Most require at least one external integration.
- Deepgram, Retell AI, Bland AI, and OpenAI bundle multiple layers. Twilio and Pipecat cover one or two. LiveKit owns transport and orchestration, with STT, TTS, and LLM available through plugins.
- HIPAA BAA requires an enterprise agreement with most providers on this list. LiveKit is the exception: its self-serve Scale plan ($500/mo) includes a signed BAA.
- Open-source options like Pipecat and LiveKit give you full pipeline control, but compliance responsibility shifts to your team.
Provider Comparison at a Glance
Eight APIs, eight different coverage footprints. This table tracks what each actually owns in production versus what gets passed through to third parties.
What the Table Covers
Columns cover protocol, pricing, compliance, deployment, and best fit. The distinction between "owned" and "pluggable" in each row drives your integration complexity and vendor count.
What to Watch First
Prioritize protocol, pricing model, and deployment control. Those three usually determine how much integration work you'll own later.
| Provider | Flagship model / service | Streaming protocol | Concurrency limits | Pricing model | HIPAA | Self-hosted deployment | Best fit |
|---|---|---|---|---|---|---|---|
| Deepgram | Voice Agent API | Managed real-time voice interaction API | Plan details at Deepgram pricing | See Deepgram pricing | Enterprise only via HIPAA deployments | Yes | If you want bundled speech plus deployment flexibility |
| Twilio | Media Streams | WebSockets for raw call audio | One bidirectional stream per call; unidirectional supports up to four tracks | Per-minute (call rates) | Product-specific per security overview | No | If you want carrier-grade telephony with external STT, TTS, and LLMs |
| Vapi | Managed orchestration platform | Managed provider streaming through Vapi | Check plan terms | Per-minute (component bundle) | Confirm with vendor | No | If you want fast rollout with pluggable providers |
| Pipecat | Open-source Python framework | Depends on your chosen transports and providers | Depends on your infrastructure | Free (OSS) | N/A for OSS deployer-owned stack | Yes | If you want full pipeline control and self-hosting |
| LiveKit | Agents SDK | WebRTC and SIP | Check plan terms | Usage-based / Free (OSS) | Scale/Enterprise via security page | Yes | If you're building WebRTC-native or multi-participant voice apps |
| Retell AI | Managed voice agent platform | Managed platform pipeline | 20 free slots; above that billed monthly per Retell pricing | Per-minute (modular add-ons) | Enterprise only | Check plan terms | If you want no-code plus API access |
| Bland AI | Managed voice API | Managed platform pipeline | Start: 10; Build: 50; Scale: 100; Enterprise: unlimited per Bland pricing | Per-minute (all-inclusive) | Enterprise only | Enterprise only | If you run high-volume inbound and outbound calling |
| OpenAI Realtime | GPT-Realtime-2 | Realtime speech-to-speech API connection | Check plan terms | Per-token (audio) | Must request via security and privacy | No | If you want browser or app voice experiences with GPT-native reasoning |
How the Programmable Voice Agent Stack Works
Separate the layers before deciding whether to buy bundled or assemble a cascade stack. Getting this wrong early means rearchitecting later.
Each provider in this list covers one or more of those layers, and the gaps determine your integration work.
The Four Layers
Every voice agent pipeline follows the same flow: STT converts speech into text, an LLM processes that text and generates a response, then TTS converts the response back into speech.
Telephony handles the phone connection, SIP trunking, and call routing. Some providers also add orchestration to manage handoffs between STT, LLM, and TTS.
Bundled APIs vs. Cascade Stacks
You have two architectural choices. Bundled APIs combine multiple layers into a single connection. They simplify integration and reduce failure points, but you lose flexibility when you want to swap one component.
Cascade stacks let you choose the provider for each layer independently. They give you maximum flexibility but also force you to manage more vendor relationships, API keys, and latency budgets.
Where Each Provider in This List Fits
The split between "Yes," "Pluggable," and "No" in this table determines how many vendors you'll manage and how much integration you'll build.
| Provider | STT | TTS | LLM | Orchestration | Telephony | Pricing Model | Open Source | Self-Hosted Option | HIPAA BAA |
|---|---|---|---|---|---|---|---|---|---|
| Deepgram | Yes | Yes | BYO | Yes | No | See Deepgram pricing | No | Yes | Enterprise only |
| Twilio | No | No | No | No | Yes | Per-minute (call rates) | No | No | Product-specific |
| Vapi | Pluggable | Pluggable | Pluggable | Yes | Yes | Per-minute (component bundle) | No | No | Confirm with vendor |
| Pipecat | Pluggable | Pluggable | Pluggable | Yes | Via serializers | Free (OSS) | Yes | Yes | N/A (deployer owns) |
| LiveKit | Pluggable | Pluggable | Pluggable | Yes | Yes (SIP) | Usage-based / Free (OSS) | Yes | Yes | Scale/Enterprise |
| Retell AI | Bundled | Bundled | Bundled | Yes | Yes | Per-minute (modular add-ons) | No | Check plan terms | Enterprise only |
| Bland AI | Bundled | Bundled | Bundled | Yes | Yes | Per-minute (all-inclusive) | No | Enterprise only | Enterprise only |
| OpenAI Realtime | Bundled (S2S) | Bundled (S2S) | Bundled | No | No | Per-token (audio) | No | No | Must request |
Deepgram
Deepgram covers the speech layer and orchestration in one API, with bundled pricing and self-hosted deployment options.
Its Voice Agent API handles STT, TTS, and LLM orchestration through a single WebSocket connection. You can also bring your own LLM, TTS provider, or both.
What It Covers
STT options include Flux (Deepgram's latest streaming model with model-native turn detection, recommended for voice agents) and Nova-3 (highest-accuracy general-purpose ASR). TTS runs on Aura-2, and Keyterm Prompting lets you boost recognition of domain-specific terms without retraining. The technical limit is 500 tokens across all keyterms.
Pricing and Deployment
Deployment options include managed cloud, VPC, and self-hosted on-premises installations. BAA terms are handled through sales and enterprise agreements.
When to Choose It
This stack makes sense if you want a single vendor for the speech layer and bundled pricing. Self-hosted options also fit when you need tighter control over the data path.
Twilio
Twilio is the telephony layer. If you need carrier-grade calling, you'll bridge audio to external STT, TTS, and LLM providers yourself.
It's the most common telephony foundation in cascade stacks, so if you're already on the platform, the question is which speech providers to bridge in.
What It Covers
The coverage here is strictly telephony: PSTN connectivity, SIP trunking, call control, and audio transport via Media Streams. External providers handle speech processing, response generation, and speech synthesis.
How the Bridge Pattern Works
Media Streams expose raw call audio as mulaw/8000, base64-encoded, over WebSockets to a server you control. Your server bridges the JSON-based protocol to an external STT provider like Deepgram for transcription. If you've worked with WebSocket audio before, the pattern is familiar. For bidirectional voice agents, the <Connect><Stream> TwiML wrapper lets you receive caller audio and send synthesized speech back into the call.
When to Choose It
Choose Twilio if you already run on its calling stack and want to keep speech providers separate. The tradeoff is operational: you'll build and maintain the bridge server yourself.
Vapi
Orchestration and telephony come managed here, with model choice left open. It's the fastest path to a working voice agent when you don't want to wire providers together yourself.
Vapi handles the wiring. You pick the providers; it manages everything in between.
What It Covers
The platform owns orchestration, scaling, and telephony. Everything else is pluggable: you configure STT, TTS, and LLM providers by API or dashboard JSON using your own API keys.
Supported Providers
STT and TTS options include Deepgram (confirmed via Deepgram partners), plus multiple alternatives for each layer. LLM support covers OpenAI, Anthropic, Google, Groq, and others. Deepgram's Keyterm Prompting may be available as a passthrough configuration.
When to Choose It
This is the right pick if you want fast time-to-production without managing infrastructure. You trade pipeline control for speed, since the platform handles the plumbing.
Pipecat
Pipecat gives you the most control and the most responsibility. This is for teams who want to assemble and host the whole pipeline and are comfortable owning the debugging when something breaks at 2 AM.
Built and maintained by the Pipecat community, this open-source Python framework gives you full pipeline-level control over every component in the voice agent stack.
What It Covers
Orchestration only. You assemble STT, LLM, TTS, and transport providers yourself in Python.
Supported Providers
STT and TTS integrations include AssemblyAI, OpenAI, ElevenLabs, Cartesia, and many others. Transport options include Daily (WebRTC), FastAPI WebSocket, and SmallWebRTCTransport. Telephony serializers cover Twilio, Telnyx, and Plivo.
When to Choose It
This is the right stack if you need low-level pipeline control, custom endpointing, or non-standard component arrangements. Since you self-host everything, compliance certifications are your responsibility.
LiveKit
Transport and orchestration are owned here, with STT, TTS, and LLM coming through plugins. It's the strongest option for WebRTC-native apps, multi-participant systems, or self-hosted real-time infrastructure.
The core is open-source WebRTC infrastructure. LiveKit's Agents SDK adds the AI layer for real-time voice and video applications.
What It Covers
LiveKit owns the transport layer, turn detection, noise and echo cancellation, pipeline orchestration, and SIP telephony. STT, TTS, and LLM are pluggable through plugins.
Self-Hosted vs. Cloud
The open-source server supports self-hosting on VMs, Kubernetes, or distributed multi-region setups. LiveKit Cloud adds managed agent deployments, global scaling, session observability, and LiveKit Inference. The same agent code runs on self-hosted and Cloud without modification. HIPAA BAA is available on Scale and Enterprise plans.
When to Choose It
Infrastructure-level control is the draw here, especially for WebRTC-native or multi-participant voice apps. The free Build plan also makes it practical for prototyping.
Retell AI
Retell AI handles all four layers in one managed platform, with modular billing that separates infrastructure, TTS, LLM, and telephony costs.
Both no-code and API paths are available, so the platform works for teams with different technical depths.
What It Covers
The platform bundles STT, LLM orchestration, TTS, and telephony. The orchestration layer adds intelligent endpointing, turn-taking, interruption handling, backchanneling, and voicemail detection. Built-in telephony integrates with Twilio and Telnyx. You can also bring your own by SIP trunking at no additional charge.
Pricing Model
Billing is modular and per-minute per Retell pricing. You pay separately for the voice infrastructure base, your chosen TTS provider, your chosen LLM, and telephony.
When to Choose It
A good fit if you want a managed platform with both no-code and API paths. HIPAA BAA is available on enterprise plans.
Bland AI
Bland AI is the bundled option for high call volumes and structured call flows. All-in pricing pairs with detailed flow control.
The API is developer-first, covering both outbound and inbound voice agents from a single integration point.
What It Covers
Bland bundles STT, LLM, TTS, and telephony into a single per-minute rate. Conversational Pathways is a node-based call flow builder. It lets you design branching logic with variable extraction, API calls at specific flow points, and conditional routing.
Pricing Model
All-inclusive per-minute pricing covers the full stack, with plan details and platform fees on their pricing page. Enterprise plans offer unlimited concurrent calls and on-premises deployment. HIPAA BAA is available on enterprise plans.
When to Choose It
Best for high-volume outbound campaigns where you need precise API-level control over call flows.
OpenAI Realtime API
The speech-to-speech option in this list. It makes sense when you want GPT-native reasoning and can manage token-based audio costs carefully.
It's the only provider here that skips the traditional STT-LLM-TTS cascade, which changes both the latency profile and the cost structure.
What It Covers
The Realtime API bundles STT, reasoning, and TTS into a single connection. GPT-Realtime-2 processes audio input and emits audio output without a separate text stage. Telephony comes from Twilio, LiveKit, or another transport provider for phone calls.
Pricing and Cost Pattern
OpenAI API pricing lists token rates and audio billing details. Token-based pricing rises with denser, longer speech activity, so intermittent usage tends to be more economical.
When to Choose It
Best for browser-first or app-embedded voice experiences where you want GPT-native reasoning. The per-token model rewards intermittent speech patterns.
How to Pick the Right Stack for Your Use Case
Choose bundled providers when you want fewer integrations, and cascade stacks when you want component control. In practice, most decisions come down to three things: control, compliance, and cost predictability.
Your choice also depends on the deployment model and transport. Phone, browser, and self-hosted use cases push you toward different combinations.
Decision Criteria by Use Case
- Contact center automation: Use Twilio for telephony plus Deepgram Voice Agent API for speech and orchestration. You keep your phone infrastructure and add voice agent capabilities through the WebSocket bridge pattern.
- Healthcare voice agents: Deepgram's self-hosted deployment keeps audio on-premises, and you can pair it with LiveKit in self-hosted mode if you need browser-based patient interactions. Deepgram offers HIPAA BAA through enterprise agreements, and LiveKit includes a signed BAA starting at its Scale plan.
- Outbound sales at scale: Bland AI gives you all-inclusive pricing and structured call flows. If you want modular billing and more flexibility in model selection instead, Retell AI is the better fit.
- Consumer app with GPT reasoning: OpenAI Realtime API handles speech-to-speech, and you can connect it through LiveKit for WebRTC transport. Just budget carefully, because per-token costs rise with higher talk-time density.
When Bundled Beats Cascade
Bundled APIs like Deepgram's Voice Agent API give you cost predictability and fewer failure points, since one connection means one latency budget to manage.
Cascade stacks, on the other hand, give you more provider flexibility: you can swap STT or TTS providers independently. The tradeoff is operational complexity, especially when you're debugging latency spikes across three or four vendor hops.
Next Steps
These APIs solve different architectural problems. The right stack depends on your use case, compliance needs, and cost model preferences.
If you want to test Deepgram's STT, TTS, or Voice Agent API, sign up for free with $200 in credits and run your own benchmarks against production audio.
FAQ
What's the Difference Between a Voice Agent API and a Voice Agent Platform?
A voice agent API gives you direct control over pipeline components. A platform adds managed infrastructure and often no-code tooling. You get speed with a platform, control with an API.
Can You Mix Providers from Different Layers of the Stack?
Yes. Pipecat and LiveKit are built for that pattern. The main constraint is latency across provider hops and the extra operational work.
How Much Does It Cost to Run a Voice Agent at 10,000 Hours per Month?
It depends on your architecture. Bundled per-minute pricing is easier to forecast. Per-token pricing varies more with speech density and session behavior, so model your actual call patterns before committing.
Which Voice Agent APIs Support HIPAA Compliance?
Deepgram, LiveKit, Retell AI, and Bland AI offer HIPAA BAA through enterprise agreements. LiveKit also includes a signed BAA on its self-serve Scale plan.
Twilio's coverage is product-specific, and OpenAI's BAA must be requested separately. With Pipecat, your infrastructure owns compliance.
Do You Need a Separate Telephony Provider with Every Voice Agent API?
Vapi, Retell AI, Bland AI, and LiveKit include telephony options. Deepgram Voice Agent API and OpenAI Realtime require a separate phone or transport layer.









