How We Took Aura-2’s TTFB from <200ms to 90ms: Engineering Real-Time Voice AI at Scale

The Engineering Challenge: TTFB and Concurrency
Aura-2 at Launch: Establishing the Baseline
Engineering Advances: Runtime-Oriented Design
Benchmarks and Evidence
Continuous Optimization
Engineering as Deepgram’s Superpower

Share this article

By Adam Sypniewski

CTO

Last Updated

Nov 14, 2025

Real-time text-to-speech (TTS) is deceptively hard to scale. Generating natural speech is one challenge; generating it in under 200 milliseconds for hundreds or thousands of users at once is another. At that scale, model size, efficiency, and infrastructure cost collide.

For consumer use cases like audiobooks or pre-recorded content, latency barely matters. But for live voice agents, contact centers, and conversational AI, responsiveness defines the user experience. If speech does not begin within a fraction of a second after someone finishes speaking, the interaction feels broken.

The industry’s usual fix is to throw hardware at the problem. That may work for small tests, but at enterprise scale it increases cost, complexity, and inefficiency, especially for customers deploying in VPC or bare-metal environments. Aura-2 took a different path. Instead of scaling hardware, we reengineered the runtime for parallelism and orchestration, achieving stable sub-200ms latency and, in steady-state conditions, around 90ms.

The Engineering Challenge: TTFB and Concurrency

Two factors determine whether real-time TTS feels conversational: time to first byte (TTFB) and concurrency. TTFB governs responsiveness; concurrency governs efficiency and cost. The tension is that improving one often worsens the other.

As concurrency rises, so does processing overhead. The bottleneck is not synthesis itself but what happens before synthesis begins. Each incoming request must be parsed, staged, and prepared for inference, a process we call prompt processing. At high concurrency, that stage sets the floor for latency since every stream competes for the same compute and memory resources. This competition can also involve shared hardware limits such as GPU memory, CPU cache, and PCIe bandwidth, which can quickly create bottlenecks or thrashing behavior.

When Aura-2 launched, it already achieved sub-200ms latency at scale. But we wanted to push the boundary further. By rearchitecting how the runtime distributes and schedules work across GPUs, and optimizing execution paths for in-place operations, we consistently reached around 90ms in steady-state performance without compromising quality or concurrency.

Aura-2 at Launch: Establishing the Baseline

Aura-2’s public debut already put it ahead of most systems in production. At launch, it delivered consistent sub-200ms latency, an enterprise-grade voice catalog with domain-specific precision, and deployment flexibility through the Deepgram Enterprise Runtime. It was designed not just for speech quality but for enterprise production environments where accuracy, responsiveness, and compliance all matter.

But launch was only the beginning. Internally, we viewed Aura-2 as a foundation to build on, not a finished product. The key questions quickly became: could we push concurrency further, reduce variability of latency under load, and redesign runtime orchestration to keep GPUs busy without adding hardware?

Engineering Advances: Runtime-Oriented Design

Where others add compute, we chose to re-architect the runtime. Adding more GPUs can temporarily reduce latency, but it does not fix the underlying coordination problem. Each additional device increases synchronization overhead and resource contention, driving up cost and power without improving true throughput. Our focus was on orchestration, ensuring that every GPU in the system was fully utilized without introducing new bottlenecks.

One major advance was workload partitioning. In earlier designs, prompt handling and audio synthesis competed for the same GPU resources. This meant that under heavy load, prompt processing could throttle the entire system. In the new runtime, we separated these concerns. Prompt handling was isolated so it no longer dragged down concurrency, while synthesis could proceed smoothly in parallel. The effect was immediate: more streams could run concurrently without raising TTFB.

We also revisited GPU scheduling and memory management. Static GPU memory allocation strategies, common in industry baselines, were leaving performance on the table. Instead, we moved to dynamic orchestration that adjusts resource allocation on the fly. This reduced contention and kept latency steady even when traffic spiked unpredictably.

Execution paths were another area of improvement. Tools like CUDA graphs and Triton kernels are widely available, but using them effectively requires careful integration. We combined them with refined batch management and maximized in-place operations, particularly in the text-to-audio generation stage. These optimizations shaved off overhead and increased throughput. The result was not marginal gains but a step-change in how efficiently GPUs could be driven.

Crucially, all of this was made possible by choices made years earlier. Building our runtime in Rust rather than Python gave us thread safety and low overhead, enabling the fine-grained orchestration that these optimizations demanded. In many ways, Rust was the conductor that allowed the orchestra of GPUs, kernels, and memory managers to play in time. Without that system foundation, the improvements above would have been nearly impossible.

I want to pause here to recognize the team behind these advances. Brent George, Andrew Lehmer, and Joshua Gevirtz all played critical roles in pushing Aura-2 forward. Their work demonstrates what happens when rigorous research, creative engineering, and relentless optimization come together.

Benchmarks and Evidence

The results speak for themselves. On enterprise-grade GPUs, Aura-2 sustained significantly higher concurrency than comparable TTS systems while keeping p95 TTFB under 200 milliseconds, with steady-state performance often around 90 milliseconds. Across both H200 and L40S deployments, Aura-2 maintained this latency profile even as load scaled substantially.

Efficiency was equally important. Rather than scaling costs linearly with demand, Aura-2 packs more streams into each GPU while maintaining conversational responsiveness. Competing architectures often require significantly more resources to achieve similar performance. Aura-2’s runtime proves that smarter engineering can deliver both lower latency and higher concurrency without pushing cost and complexity onto customers.

Table: Core Benchmarks for Aura-2 Runtime

Metric	H200 (Enterprise Runtime)	L40S (Enterprise Runtime)

Relative Concurrency Supported	High concurrency utilization (≈30–40% higher stream density per GPU vs. previous generation)	Comparable or greater concurrency scaling when paired (2×L40S GPUs)
p95 TTFB (steady-state)	~90–200ms	~90–200ms

Metric

Relative Concurrency Supported

H200 (Enterprise Runtime)

High concurrency utilization (≈30–40% higher stream density per GPU vs. previous generation)

L40S (Enterprise Runtime)

Comparable or greater concurrency scaling when paired (2×L40S GPUs)

For customers, these benchmarks translate into tangible outcomes. Contact center agents respond conversationally fast. Voice assistants scale to handle thousands of simultaneous users without latency spikes. Infrastructure teams see lower bills and simpler scaling decisions because concurrency is engineered into the runtime itself.

In practice, this means scaling up no longer requires spinning up additional machines. Enterprises can serve more users with the same infrastructure, while maintaining natural-sounding responses and predictable performance under load.

Continuous Optimization

We emphasize that these benchmarks are not absolutes. Results vary depending on workload type, prompt length, and traffic shape. Real-world enterprise environments are dynamic, and optimization is never finished.

Our culture at Deepgram is to treat this as an ongoing process. Every deployment generates traces, edge cases, and ideas for improvement. Engineers debate them, test hypotheses, and push refinements into production. That cycle (trace, debate, refine, validate) is how Aura-2 achieved its gains, and it is how we will continue to build.

Engineering as Deepgram’s Superpower

Aura-2’s performance breakthroughs are not the result of buying more GPUs or flipping a switch. They are the product of careful orchestration: partitioning workloads, rethinking GPU scheduling, optimizing execution paths, and building on a systems foundation that prioritizes efficiency.

Where others rely on brute force, we treat engineering as the differentiator. The result is a model that responds faster, runs leaner, and scales more predictably in enterprise production environments.

This is not a one-off achievement. It reflects how we build every model: engineering-first, transparent in benchmarks, and relentless in optimizing for real-world scale. For enterprises that need TTS to be fast, efficient, and dependable, Aura-2 is proof that engineering at the runtime makes all the difference. And by solving efficiency at the engineering level, we unlock more freedom to invest compute and research toward the next generation of speech models, rather than compensating for architectural inefficiencies.