Article·Jul 21, 2025
10 min read

Designing Voice AI Workflows Using STT + NLP + TTS

Voice AI architecture built on STT → NLP → TTS still delivers the lowest latency and the greatest flexibility for customer-facing apps. If you want to build such a pipeline yourself from the ground up, check out this tutorial!
10 min read
By Stephen Oladele
Updated
Published

⏩ TL;DR

  • Voice AI architecture built on STT → NLP → TTS still delivers the lowest latency and the greatest flexibility for customer-facing apps.

  • IMPORTANT NOTE: If you want to develop this pipeline yourself, this article functions as a tutorial. However, if you’d like everything streamlined for you out-of-the-box, check out our Voice Agent API!

  • Deepgram’s streaming STT offers sub-300 ms transcription with >90 % accuracy out-of-the-box.

  • The new Aura-2 TTS model introduces enterprise-grade voices and sub-200 ms time-to-first-byte (TTFB) — 2–4× faster than many incumbents.

  • Pair both with an LLM layer (GPT-4o, Claude 3, Gemma 2, etc.) to build a conversational loop that feels “instant” to users.

Follow the step-by-step tutorial and repo below, then claim $200 in free Deepgram credits to test your own STT NLP TTS stack.


“Alexa, what’s the weather?” 

That six-word request triggers over two dozen microservices—yet the entire round trip still feels instant. At its core, building a real-time voice interaction involves a straightforward flow: speech-to-text → language reasoning → text-to-speech

Master that cascade, and you can build anything from hands-free CRMs to real-time game narrators.

Why does this matter in 2025?
Voice users now expect sub-second back-and-forth. Miss that mark, and they tap the screen instead. The surest way to stay under the human turn-taking threshold (≈800 ms) is the proven STT → NLP → TTS pipeline.

What you’ll learn and build:

  • Transcribe with Deepgram Nova-3—sub-300 ms word finalization and industry-leading WER.

  • Reason with GPT-4o-mini (or a drop-in open-source model) for natural, tool-aware replies.

  • Speak with Deepgram Aura-2—< 200 ms TTFB (Time to First Byte) and 40-plus professional voices.

By the end, you’ll have:

  • A runnable repo and sub-2,500 ms round-trip latency benchmark (and a roadmap to < 800 ms with edge optimization).

  • A production-ready blueprint you can drop into call centres, IVRs, in-game companions, or any workflow that needs a voice.

🎙️ Stage 1: Choosing the Right STT (Why Deepgram Nova-3)

The first step in any voice agent pipeline is transcription: converting speech to text. This needs to be fast, accurate, and streaming-capable.

We chose Deepgram Nova-3 for its real-time performance:

Source: internal Deepgram benchmark, updated Apr 2025.

Features that matter for our stack:

  • Supports API integration via REST API and WebSocket-based streaming (no lock-in SDK).

  • < 300 ms streaming latency—maps neatly to our 300 ms STT budget and gives us headroom for GPT + TTS.

  • High accuracy on noisy, multi-speaker calls, domain-specific data.

  • One token works for STT and TTS.

You also get features like domain keyword biasing, injecting key terms to the model snap to domain jargon without retraining:

💻 Quick Start: 20-line Nova-3 Streamer

Here are the prerequisites to build the voice AI agent in our demo (see the repo for the complete code):

Let’s get started! 🚀

Run with:

In the full agent (see realtime_voice_agent.py) the same call happens inside run_stt():

The snippet above shows how quickly Deepgram Nova-3 integrates via WebSocket, capturing mic audio (audio_q) and streaming accurate transcriptions (utter_q) straight to the next stage.

This ensures clean, punctuated input for the next step—reasoning.

🧠 Stage 2: Thinking with an LLM (Why We Pick GPT-4o)

The Language Model (LLM) is the AI’s “brain,” responsible for understanding context and responding naturally. We chose OpenAI’s GPT-4o because of its:

  • Streaming capability and fast token-by-token response (key for overlap with TTS)

  • Low latency first-token generation (~200–400 ms)

  • Powerful reasoning and natural conversation capabilities

  • Multimodal input/output flexibility

  • Optimal for voice assistant tasks, Q&A, and dynamic IVR systems

GPT-4o delivers both:

RTF ≈ “real-time factor” – lower = faster. ≤1.0 means real time.

⚡ LLM Prompting Pattern (from our voice agent script)

We stream user utterances from STT directly into GPT-4o, handling streaming tokens for real-time responsiveness in your voice agent like this:

Note: The SYS_PROMPT ensures GPT-4o-mini keeps responses succinct (≤6 words), making it perfect for voice interactions.

Security tip: 🔒 Strip or hash PII in user_utt before logging or analytics.

You handle token streaming in real-time:

The loop writes each streamed token into token_q so the next stage can start synthesizing before the LLM is done.

The GPT stage is tightly bound to latency. We even log how fast the first token comes back:

This gives transparency into how the model performs turn-by-turn.

🔄 Alternative: Llama-3 (Open-source Option)

For security-critical deployments or offline operation, Meta’s Llama-3 70B offers open weights, local inference, and competitive latency:

  • Cost-effective (~$0.001 per 1K tokens on A100 GPU)

  • Fast inference (0.6–0.9× real-time factor)

  • Strong offline performance for privacy-sensitive applications

Always remember to redact PII from transcripts before LLM invocation.

🗣️ Stage 3: Speaking with Deepgram Aura-2

With text in hand, we need < 200 ms TTFB synthesis. Aura-2 gives us TTS first-byte in ≈ 200 ms—critical for staying < 3 s round-trip time (RTT):

With text in hand, the voice response must sound natural and arrive quickly to maintain conversational flow. We chose Deepgram Aura-2 for this stack because:

  • Sub-200 ms Time-to-First-Byte (TTFB)—critical for staying < 3 s round-trip time (RTT).

  • 40+ high-quality voices, multilingual support.

  • Cost-effective streaming ($0.030 per 1,000 characters).

  • Unified authentication with Nova-3 (STT).

For more information, check out this article.

🎧 Aura-2 Streaming Setup (from our voice agent script)

Aura-2 handles real-time audio synthesis through WebSocket. The key lines within the tts_sender() and tts_receiver() involve micro-batching tokens, which are approximately one sentence long, and then pushing them. 

Here's the core pattern we use:

…and the watchdog that decides when playback is really finished:

Playback detection lives in tts_receiver() with a three-pronged watchdog:

  1. PlaybackFinished in finished_playback() control frame

  2. PCM queue emptied for ≥ 250 ms

  3. Hard stop after silence_timeout_max (3 s default)

With this loop, the user hears the assistant ~600-1,000 ms after finishing a sentence, then sees (there might be a cold start on your first call, which could increase TTFB):

… and the loop repeats. Exactly the UX we want. ✅

📜 Full Working Script

The complete file lives at realtime_voice_agent/realtime_voice_agent.py (see repo). 

Clone → install deps → run → speak, hear the reply, then speak again when you see “🎤 You can speak now …”:

Here is what happens:

  • STT chunks (8 k) stream out every ~167 ms.

  • GPT-4o first token typically arrives 100-200 ms later (logged as ⚡ GPT first token … ms).

  • Aura-2 starts audio in another < 200 ms.

  • The watchdog finishes and prints the RTT log lines with timestamps like:

Tweak ALLOW_INTERRUPT=True if you want to talk over the voice (use headphones!).

End-to-End Latency Targets

*MacBook M1 Pro 2021, 48 kHz mic, 51 Mbps Wi-Fi.

Next Steps to Reduce RTT below 1000 ms

In production, you can bring this RTT below 1000 ms with:

  • Bump RATE down to 16 kHz (saves ~6 KB/s over the wire)

  • Replace SYS_PROMPT with a domain-specific policy or function-calling schema.

  • Adjusting SEND_EVERY and streaming buffer size

  • Lower queue_empty_wait to 0.15 s and run the agent and LLM in the same region.

Move inference closer to the edge (via Whisper + local LLM [e.g., Llama-3 via vLLM] + open-source TTS).

🎯 Production Readiness: Scaling and Observability

Once your real-time Voice AI agent works locally, the next challenge is production readiness—making it reliable, observable, and scalable.

Deployment Models

  • Edge deployments: Great for low-latency, geographically dispersed user bases (ideal for consumer-facing apps). Edge-based apps use WebRTC/JS audio capture in the browser → WebSocket relay to STT. Keep each WebSocket < 30 s to avoid edge idle limits.

  • Container deployments: Easier to manage and autoscale, suitable for enterprise or internal-use workloads. Container-based apps run everything server-side (Python + FastAPI/Flask, plus WebSockets), which makes scaling and observability easier. Pin each user to a pod; easier connection reuse, easier GPU LLM colocation too.

For production, containerize your app and autoscale based on concurrent TTS requests using Cloud Run, ECS Fargate, or K8s with a Horizontal Pod Autoscaler.

Observability and Metrics

Treat the voice stack like any other stateless microservice: autoscale horizontally, emit structured metrics, and alert on 95th-percentile latency. 

Use standard metrics to monitor your system:

Log these and expose them to Prometheus (see this guide). If you are running a self-hosted or containerized deployment, add the target to the Prometheus configuration file in the prometheus.yml file. 

Locate the scrape_configs section, and add a new job with the Engine container instance as a target (excerpt from the Deepram docs):

Add 95th-percentile latency spike alerts in Prometheus Alertmanager to proactively catch slowdowns. 

With the integration complete, you can now query the collected metrics using the Prometheus web interface or API. You may consider using a tool like Grafana for handling visualization and alerting.

Rate-limit and Token Management

  • Deepgram workspace cap: 120 req/s / key.

  • Multiplex 4*WS per user (stt, tts, heartbeat, analytics) → keep burst ≤ 20 req/s.

  • Rotate keys weekly via CI secret store; invalidate on user logout.

💰 Cost Optimization Tips

Cost scales with usage. Here’s how to optimize:

Enable Opus in Deepgram via &encoding=opus – still < 300 ms decode.

⚠️ Common Pitfalls, Symptoms, and Fixes

Common deployment issues, the symptoms, and immediate fixes:

🏁 Wrap-Up and Next Steps

You now have 🎉

  • A production-grade blueprint: Nova-3 (STT) ➜ GPT-4o (LLM) ➜ Aura-2 (TTS) with sub-3 s round-trip.

  • Drop-in, commented code (see realtime_voice_agent.py) you can fork and run today.

  • ✅ Benchmarks showing industry-leading latency, accuracy, and cost-efficiency

Whether you're building an in-game NPC, a voice-based assistant, or a smart IVR agent, this blueprint gives you a reliable, scalable, and fast starting point.

🎯 What to do next?

🙌 Thanks for building with us

We created this guide to give you real-world code, production insights, and performance confidence.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.