Speech-to-Speech vs Cascade: Voice Agent Architecture

Listen to article10:39

Key Takeaways
Provider Comparison at a Glance
How to Read This Table
Comparison Methodology
Decision Point Summary
How Cascade and Speech-to-Speech Architectures Work
What Happens Inside a Cascade Pipeline
What Happens Inside a Speech-to-Speech Model
Where the Architectures Overlap
Where Each Architecture Breaks in Production
Cascade Failure Modes at Scale
S2S Failure Modes at Scale
The Observability Gap
Cost, Compliance, and Observability Tradeoffs
Pricing Models and Cost Predictability
Compliance and Audit Trail Requirements
When Observability Is a Regulatory Requirement
How to Choose the Right Architecture for Your Workload
Workloads That Favor Cascade
Workloads That Favor S2S
The Case for Bundled Cascade APIs
Building Voice Agents That Ship and Stay Running
Start With Your Constraints
Try It Yourself
FAQ
What Is the Main Difference Between S2S and Cascade Voice Agent Architectures?
Can Speech-to-Speech Models Handle Function Calling and Tool Use?
Which Architecture Has Lower Latency for Voice Agents?
Do Speech-to-Speech Models Work for HIPAA-Compliant Voice Applications?
Can You Switch From Cascade to Speech-to-Speech Without Rebuilding Your Voice Agent?

Listen to article10:39

The voice agent market has split into two architectural camps. Cascade pipelines chain separate STT, LLM, and TTS components through a text layer. Speech-to-speech models skip text entirely.

They map audio in to audio out in a single model. Most comparisons frame this as a latency question. But cost, debuggability, and compliance readiness matter more at production scale. The architecture you pick determines your cost trajectory and your debugging options. It also affects whether you can pass a compliance audit.

Get it wrong before your first deployment, and you'll spend months rebuilding it. What should have been a configuration choice turns into an architectural rewrite. As of 2026, vendors including OpenAI, Google Cloud, and Hume AI offer S2S options such as Realtime API, Gemini Live, and EVI 3. Cascade remains the production default for most voice agent workloads. This guide breaks down when each architecture earns that role.

Key Takeaways

Here's what you need to know before choosing a voice agent architecture:

Cascade pipelines produce text at every stage. That gives you component-level debugging and compliance-ready audit trails across all three pipeline boundaries.
S2S models can match cascade on latency. They also introduce opaque failure modes that are harder to trace.
Token-based S2S pricing grows non-linearly with conversation length. Observed costs can reach 4x theoretical minimums.
Regulated workloads such as HIPAA and SOC 2 structurally favor cascade. Text intermediates simplify auditability.
Bundled cascade APIs combine the debuggability of cascade with single-API integration simplicity.

Provider Comparison at a Glance

Cascade is the safer production default because it gives you text boundaries for debugging, audit trails, and component swaps. S2S is simpler on paper, but it removes the text layer many production workflows depend on.

How to Read This Table

Each row isolates a structural property that affects debugging, cost, or compliance workflows. The comparisons reflect architecture-level differences, not vendor-specific implementations.

Comparison Methodology

This comparison focuses on production concerns you can inspect directly: auditability, failure isolation, cost behavior, and component flexibility. It compares architectural defaults, not polished demo behavior.

Decision Point Summary

This table highlights the tradeoffs that usually decide your architecture. Use it to compare auditability, failure isolation, and cost behavior before you tune for raw latency.

How Cascade and Speech-to-Speech Architectures Work

The core difference is simple. Cascade exposes text at each handoff, while S2S keeps everything inside one speech model.

What Happens Inside a Cascade Pipeline

A cascade voice agent processes audio through three discrete stages. First, speech-to-text converts the caller's audio into a transcript. That transcript passes to an LLM for intent resolution and response generation.

Finally, text-to-speech converts the LLM's text output into audio the caller hears. At every boundary, you get a readable text artifact. You can log it, filter it, redact it, or route it to a different component.

What Happens Inside a Speech-to-Speech Model

A speech-to-speech model takes raw audio as input and produces raw audio as output. There's no text stage in between. The OpenAI Realtime API documentation describes this as voice-to-voice interaction without an intermediate STT or TTS step, which reduces latency for voice interfaces.

In practice, the model handles transcription, reasoning, and synthesis in a single forward pass. You get audio out. But you don't get a text record of what the model "thought" or "said" unless you add a parallel transcription layer.

Where the Architectures Overlap

Both architectures support function calling, session management, and streaming audio output. The overlap is real, but the internal mechanics differ in ways that matter for production. Cascade exposes text at each stage by default. S2S models can produce transcripts, but those transcripts come from a parallel process.

That process sometimes diverges from the audio. Azure stream mismatch shows audio and transcript streams producing different content. Documented cases include language bleed-through. In those cases, audio briefly switches languages while the transcript stays correct.

Where Each Architecture Breaks in Production

Both architectures fail in production, but they fail differently. Cascade makes failures easier to isolate. S2S often hides the root cause inside one model.

Cascade Failure Modes at Scale

Cascade latency is additive. Each component contributes its own processing time. The LLM stage is typically the bottleneck. Domain-specific vocabulary also creates compound failures. Heavy use of industry-specific terms combines with telephony audio quality degradation.

Together, they increase error rates. Tracking word error rate at the STT stage lets you isolate transcription degradation before it compounds downstream. The upside is that you can pinpoint which component failed. A transcription error looks different from an LLM hallucination.

S2S Failure Modes at Scale

S2S failures tend to be silent and hard to attribute. Pipecat's issue tracker shows race conditions in interruption handling that leak stale audio. It also documents VAD misconfigurations that silently break turn detection on 8kHz telephony audio.

Gemini Live message parsing bugs have crashed workers by returning malformed responses that the SDK didn't catch. In one case from LiveKit Agents, a fallback adapter added for reliability silently disabled adaptive interruption detection. The reliability pattern itself introduced the failure.

The Observability Gap

Cascade gives you text logs at every stage. S2S gives you audio in and audio out, plus an optional transcript that may not match. When a cascade agent says something wrong, you check the transcript, the LLM output, and the TTS input. That narrows the root cause fast.

When an S2S agent says something wrong, you listen to recordings and hope the transcript is accurate. If you've ever scrubbed through a 20-minute call recording looking for a single wrong utterance, you know how that scales. Engineers recommend cascade for production. They specifically cite monitoring, debugging, and instruction-following as weaker in S2S APIs today.

Cost, Compliance, and Observability Tradeoffs

If you care about predictable costs and auditability, cascade has the structural advantage. S2S can work, but you'll often add extra layers to recover what cascade already gives you.

Pricing Models and Cost Predictability

Token-based S2S pricing looks cheap at first glance. But those are theoretical minimums. The OpenAI Realtime API re-sends the full conversation history each turn. Observed production costs can climb well above those minimums on longer sessions.

Budget meetings don't go well when your per-minute cost depends on how chatty your callers are. Deepgram's Voice Agent API uses connection-time billing instead. You pay per minute of WebSocket session time regardless of token volume. Check current rates on the Deepgram pricing page.

Compliance and Audit Trail Requirements

HIPAA's audit controls standard requires mechanisms that record and examine activity in systems containing electronic protected health information. The regulation doesn't specify text transcripts by name. In practice, compliance teams treat transcripts as required audit artifacts.

HHS telehealth guidance treats session recordings and transcripts as separate categories of ePHI requiring Security Rule safeguards. The Audit Controls standard separately requires mechanisms that record and examine system activity. Cascade pipelines generate these text artifacts as a natural byproduct. S2S pipelines need a parallel transcription layer to produce equivalent documentation.

When Observability Is a Regulatory Requirement

For voice AI, SOC 2 Processing Integrity is usually about proving processing is complete, valid, accurate, timely, and authorized. Compliance practitioners treat AI logging and monitoring as central to that work, especially where output accuracy matters.

To prove those properties, you need to inspect inputs and outputs at each stage. Cascade gives you that inspection point at every text boundary. S2S requires you to reconstruct those inspection points through additional transcription. That adds latency and adds architectural complexity. The NIST AI RMF 1.0 also treats explainability and monitoring as support for documentation, audit, and governance.

How to Choose the Right Architecture for Your Workload

Your architecture should follow your constraints, not the demo that sounds best. In most production workloads, transcript needs, compliance requirements, and pricing behavior matter more than novelty.

Workloads That Favor Cascade

Choose cascade when you need component-level control. Healthcare voice agents that must redact PHI before it reaches an LLM need a text boundary. That's where redaction happens deterministically.

Contact centers need text at the STT output stage for audio intelligence analysis on transcripts. Any workload where you swap LLM providers quarterly, or A/B test TTS voices, benefits from components you can replace independently.

Workloads That Favor S2S

S2S fits when natural conversational dynamics matter more than auditability. Creative applications, demo environments, or consumer-facing voice experiences without compliance requirements can benefit from S2S's unified model. Mixed-language conversations where your callers switch languages mid-sentence are another strong fit. If you don't need reliable function calling and don't need to inspect the reasoning layer, S2S reduces pipeline complexity.

The Case for Bundled Cascade APIs

You don't have to choose between cascade's debuggability and S2S's simplicity. Bundled cascade APIs give you text intermediates with a single-API integration experience. Deepgram's Voice Agent API combines STT, LLM orchestration, and TTS over a single WebSocket connection. You also get BYO LLM and BYO TTS options for component flexibility. Check the Voice Agent API docs for current capabilities.

Building Voice Agents That Ship and Stay Running

Start with your constraints, then choose the simplest architecture that still meets them. If you pick the wrong foundation early, you'll pay for it in cost, debugging time, and rework.

Start With Your Constraints

Map your requirements before you pick a model. If you need HIPAA audit trails, cascade is the structural path. Deepgram maintains HIPAA-aligned deployments. BAA availability is handled via sales and enterprise agreements; see compliance documentation.

If you need predictable per-session costs across thousands of concurrent sessions, connection-time billing beats per-token billing. If you need to swap your LLM provider in six months, cascade keeps that option open. If you need mixed-language support without compliance constraints, S2S handles code-switching more naturally than chained components.

Try It Yourself

Deepgram's Voice Agent API gives you a bundled cascade pipeline with predictable billing, plus BYO LLM and BYO TTS options. Sign up free with $200 in credits and test how your audio performs under production conditions.

FAQ

What Is the Main Difference Between S2S and Cascade Voice Agent Architectures?

Cascade gives you text between STT, LLM, and TTS. S2S keeps that work inside one model. If your logging, filtering, or analytics depend on text, migration gets harder fast.

Can Speech-to-Speech Models Handle Function Calling and Tool Use?

Yes, but support isn't the same as reliability. The main production question is whether audio input and output reduce tool-use consistency enough to break your workflow.

Which Architecture Has Lower Latency for Voice Agents?

Neither wins by default. WebSocket setup, VAD behavior, network hops, codec choice, and sample rate all affect response time. Benchmark your own stack.

Do Speech-to-Speech Models Work for HIPAA-Compliant Voice Applications?

They can, but you'll usually need parallel transcription to create audit artifacts. That adds cost, complexity, and another failure point.

Can You Switch From Cascade to Speech-to-Speech Without Rebuilding Your Voice Agent?

Usually not. Text-based QA, filtering, and compliance logging don't transfer cleanly to audio-only workflows, so you'll likely rebuild those layers.

Listen to article10:39

Key Takeaways
Provider Comparison at a Glance
How to Read This Table
Comparison Methodology
Decision Point Summary
How Cascade and Speech-to-Speech Architectures Work
What Happens Inside a Cascade Pipeline
What Happens Inside a Speech-to-Speech Model
Where the Architectures Overlap
Where Each Architecture Breaks in Production
Cascade Failure Modes at Scale
S2S Failure Modes at Scale
The Observability Gap
Cost, Compliance, and Observability Tradeoffs
Pricing Models and Cost Predictability
Compliance and Audit Trail Requirements
When Observability Is a Regulatory Requirement
How to Choose the Right Architecture for Your Workload
Workloads That Favor Cascade
Workloads That Favor S2S
The Case for Bundled Cascade APIs
Building Voice Agents That Ship and Stay Running
Start With Your Constraints
Try It Yourself
FAQ
What Is the Main Difference Between S2S and Cascade Voice Agent Architectures?
Can Speech-to-Speech Models Handle Function Calling and Tool Use?
Which Architecture Has Lower Latency for Voice Agents?
Do Speech-to-Speech Models Work for HIPAA-Compliant Voice Applications?
Can You Switch From Cascade to Speech-to-Speech Without Rebuilding Your Voice Agent?

Listen to article10:39

The voice agent market has split into two architectural camps. Cascade pipelines chain separate STT, LLM, and TTS components through a text layer. Speech-to-speech models skip text entirely.

Key Takeaways

Here's what you need to know before choosing a voice agent architecture:

Cascade pipelines produce text at every stage. That gives you component-level debugging and compliance-ready audit trails across all three pipeline boundaries.
S2S models can match cascade on latency. They also introduce opaque failure modes that are harder to trace.
Token-based S2S pricing grows non-linearly with conversation length. Observed costs can reach 4x theoretical minimums.
Regulated workloads such as HIPAA and SOC 2 structurally favor cascade. Text intermediates simplify auditability.
Bundled cascade APIs combine the debuggability of cascade with single-API integration simplicity.