Voice Agent API Architecture: Bundled vs Assembled

Listen to article09:07

Key Takeaways
How Voice Agent Architectures Actually Work in Production
Cascaded pipeline
Speech-to-speech models
Bundled APIs
Why Pricing Model Is an Architectural Decision
Mixed billing units
Cost variability
Design pressure
What Bundled Pricing Changes in Your Architecture
Less middleware
Different scaling model
Customization limits
Scaling Math: Assembled vs. Bundled at Production Volume
Cost shape at volume
Production examples
Latency budget
Choosing Your Voice Agent Architecture
Start bundled first
Use three filters
Test the Full Loop Yourself
FAQ
What is a bundled voice agent API?
How does bundled pricing compare with per-component billing at scale?
Can you use your own LLM or TTS with a bundled voice agent API?
What latency should you expect from a bundled voice agent architecture?
When should you build your own STT, LLM, and TTS pipeline instead of using a bundled API?

Listen to article09:07

If you're choosing between an assembled stack and a bundled voice agent API, pricing changes more than your bill. It changes your architecture, your failure modes, and how much integration work your team owns.

An assembled DIY stack can run a relatively low per-minute cost, based on published rate-card analysis of a best-of-breed stack combining separate STT, LLM, TTS, and telephony providers like Twilio.

Route those same components through a managed orchestration layer, though, and the all-in rate can rise materially. The difference between those two numbers is an architectural choice, not a billing footnote.

Key Takeaways

Here's what you need to know about voice agent pricing and architecture:

Assembled stacks can range from relatively low cost to expensive production deployments.
Bundled APIs reduce vendor sprawl by hiding more of the voice loop behind one interface.
Streaming STT is the biggest latency decision: streaming pipelines hit sub-second time-to-first-audio, while batch transcription runs multiple seconds slower.
Assembled stacks win when you need custom ASR, self-hosted models, or strict component control.
Speech-to-speech models can be fast, but they still trade off against tracing and tool control.

How Voice Agent Architectures Actually Work in Production

Every voice agent runs the same loop. Speech captures speech, an LLM generates a response, and speech synthesis speaks it back. How you connect those parts determines cost, latency, and operational drag.

Cascaded pipeline

The most common production setup chains three services. Audio flows from a microphone to an STT provider over WebSocket. The transcript hits an LLM API. The LLM response then routes to a TTS provider, which streams audio back.

You pick each provider independently, and you manage each one independently too. In practice that's three API keys, three billing dashboards, three rate-limit policies, and three error-handling paths to babysit.

A peer-reviewed benchmark measured this setup at 755ms time-to-first-audio, but only when all three layers stream. Streaming end to end is what pulls the pipeline into conversational territory.

Speech-to-speech models

Speech-to-speech models collapse the pipeline. They take audio input and return audio output from one model, without a text intermediate stage.

Measured task-completion latency for these models runs from roughly 4 to 8 seconds in recent benchmarking. The result depends on benchmark design and task complexity. S2S also preserves tone and prosody that text intermediates discard.

The trade-off is control. Tool-calling reliability remains unsettled in single-stage models. For production agents that need strong function calling or self-hosted deployment, cascaded pipelines still look more proven.

Bundled APIs

A bundled Voice Agent API packages STT, LLM orchestration, and TTS behind one WebSocket endpoint. You send audio in and get audio back. The provider handles more of the interaction layer instead of asking you to wire every component yourself.

Deepgram publishes current tiers, BYO options, and billing details on its pricing page. BYO LLM and BYO TTS tiers let you swap some components while keeping the orchestration layer. That's a trade: less low-level control, less integration pain.

Why Pricing Model Is an Architectural Decision

Pricing changes what you build. It determines how many services you manage, where latency piles up, and how much forecasting work lands on your team.

Mixed billing units

When you assemble providers, you inherit billing in different units. STT bills per audio minute. LLMs bill per token, with input and output priced separately. TTS bills per character.

Turning that into one cost per conversation minute takes real modeling work. You have to account for turn depth, token volume, and how much your agent actually talks.

Cost variability

Production measurements show how wide the range can be. A simple agent using GPT-4.1 mini can keep LLM fees low. A complex agent that chains many tool calls across a turn can drive inference costs much higher, since each call carries its own accumulating context.

The pipeline can stay the same while the bill changes dramatically. So pricing isn't a finance question you settle after the build. It's an architecture question you settle before it.

Design pressure

Token-based LLM pricing scales with conversation complexity, not duration. TTS character billing scales with how much your agent talks. Bundled time-based billing is easier to forecast because you don't need to model token distributions every month.

That predictability changes design behavior. With per-token billing, you'll usually pressure prompts, tool calls, and outputs to stay lean. Inside a bundled API, you give up some tuning freedom in exchange for simpler operations.

What Bundled Pricing Changes in Your Architecture

Bundled pricing takes a lot of engineering chores off your plate, but it does so by narrowing how much you can customize. You're trading low-level control for less integration work.

Less middleware

Assembled stacks need middleware for each failure mode. Deepgram's workflow guidance describes the kinds of logic you'd otherwise write yourself: retries, timeout padding, partial transcript fallbacks, and STT backoff queues.

A bundled API reduces how much of that coordination sits in your app layer. Your code gets smaller. Your debugging tree usually does too.

Different scaling model

Assembled stacks force you to manage separate rate-limit namespaces. STT cares about WebSocket connections. LLMs care about tokens per minute. TTS may care about concurrent generations.

Those limits don't map cleanly. A traffic spike can stress all three at once. In contrast, bundled APIs hide more of that internal coordination behind one surface.

With Deepgram, that coordination collapses into one project-level concurrency pool rather than three separate ones, so you're tracking a single set of limits instead of reconciling three.

Customization limits

Bundled APIs constrain what you can swap. Deepgram's feature set includes BYO LLM and BYO TTS options. You can keep Deepgram's STT and orchestration layer. You can also route to your preferred language model or speech synthesis provider.

What you can't swap as easily is the STT layer. If your word error rate target depends on a domain-specific ASR model trained on your data, an assembled stack still makes more sense.

Scaling Math: Assembled vs. Bundled at Production Volume

At low volume, assembled and bundled stacks can look closer than expected. At production volume, the differences show up in infrastructure and latency budgets.

Cost shape at volume

At modest volume, a DIY stack can cost materially less than some fully managed platforms, since you're paying raw component rates without an orchestration markup.

At larger volume, the math can diverge further. Assembled stacks can add infrastructure and subscription overhead as you scale.

Production examples

Five9 handles billions of call minutes annually across 2,000+ customers. It integrated Deepgram's ASR for real-time transcription in its IVA platform. The deployment achieved 2–4x higher accuracy than alternatives for alphanumeric inputs. One healthcare customer also doubled user authentication rates.

That example doesn't prove one architecture wins everywhere. It does show that speech infrastructure decisions affect downstream business outcomes.

Latency budget

Whether your STT streams or runs in batch is still the biggest latency decision you'll make. Cascaded pipelines that stream end to end achieved sub-second time-to-first-audio in cited research. Waiting for a full batch transcription instead pushes first-word delay into multi-second territory that breaks the feel of a live conversation.

Bundled APIs that stream internally can target similar latency profiles without asking you to tune every part yourself.

Choosing Your Voice Agent Architecture

For most teams, bundled is the faster starting point. You should assemble components only when your requirements demand deeper control.

Start bundled first

If you're below 10,000 minutes per month and don't need custom ASR, a bundled API can save weeks of orchestration work. If you're above that threshold and already have infrastructure depth, assembled components give you more cost levers.

Deepgram's BYO LLM and BYO TTS tiers create a middle path. You keep orchestration and STT while swapping the model layers you care about most.

Use three filters

Three questions usually cut through the noise. Do you need a custom-trained STT model for domain vocabulary? Do compliance requirements force on-premises deployment for every component? Do you have the engineering capacity to maintain cross-provider error handling?

If the answer is no across the board, bundled is usually the simpler starting move.

Test the Full Loop Yourself

If you want to test the full STT, LLM, and TTS loop yourself, start in the Deepgram Console.Try it now and confirm the current new-account offer at signup, including any available $200 free credits.

FAQ

What is a bundled voice agent API?

It's one endpoint that packages STT, LLM orchestration, and TTS. You send audio in and receive speech back, without wiring three separate services yourself.

How does bundled pricing compare with per-component billing at scale?

Bundled pricing is easier to forecast because it reduces token and character variability. At high volume, assembled stacks can be cheaper, but they need more infrastructure work.

Can you use your own LLM or TTS with a bundled voice agent API?

Yes. Deepgram's Voice Agent API supports BYO LLM and BYO TTS options while keeping Deepgram's STT and orchestration layer.

What latency should you expect from a bundled voice agent architecture?

Expect performance near well-tuned streaming pipelines. The bigger variable is streaming versus batch STT, not bundled versus assembled.

When should you build your own STT, LLM, and TTS pipeline instead of using a bundled API?

Build your own when you need custom ASR, a self-hosted LLM, on-premises deployment, or proprietary TTS control. It also fits teams that already run WebSocket infrastructure and cross-provider monitoring.

Listen to article09:07

Key Takeaways
How Voice Agent Architectures Actually Work in Production
Cascaded pipeline
Speech-to-speech models
Bundled APIs
Why Pricing Model Is an Architectural Decision
Mixed billing units
Cost variability
Design pressure
What Bundled Pricing Changes in Your Architecture
Less middleware
Different scaling model
Customization limits
Scaling Math: Assembled vs. Bundled at Production Volume
Cost shape at volume
Production examples
Latency budget
Choosing Your Voice Agent Architecture
Start bundled first
Use three filters
Test the Full Loop Yourself
FAQ
What is a bundled voice agent API?
How does bundled pricing compare with per-component billing at scale?
Can you use your own LLM or TTS with a bundled voice agent API?
What latency should you expect from a bundled voice agent architecture?
When should you build your own STT, LLM, and TTS pipeline instead of using a bundled API?

Listen to article09:07

An assembled DIY stack can run a relatively low per-minute cost, based on published rate-card analysis of a best-of-breed stack combining separate STT, LLM, TTS, and telephony providers like Twilio.

Key Takeaways

Here's what you need to know about voice agent pricing and architecture:

Assembled stacks can range from relatively low cost to expensive production deployments.
Bundled APIs reduce vendor sprawl by hiding more of the voice loop behind one interface.
Streaming STT is the biggest latency decision: streaming pipelines hit sub-second time-to-first-audio, while batch transcription runs multiple seconds slower.
Assembled stacks win when you need custom ASR, self-hosted models, or strict component control.
Speech-to-speech models can be fast, but they still trade off against tracing and tool control.