Article·Jul 21, 2025

Designing Voice AI Workflows Using STT + NLP + TTS

Voice AI architecture built on STT → NLP → TTS still delivers the lowest latency and the greatest flexibility for customer-facing apps. If you want to build such a pipeline yourself from the ground up, check out this tutorial!

10 min read
Featured Image for Designing Voice AI Workflows Using STT + NLP + TTS

By Stephen Oladele

Last Updated

⏩ TL;DR

  • Voice AI architecture built on STT → NLP → TTS still delivers the lowest latency and the greatest flexibility for customer-facing apps.
  • IMPORTANT NOTE: If you want to develop this pipeline yourself, this article functions as a tutorial. However, if you’d like everything streamlined for you out-of-the-box, check out our Voice Agent API!
  • Deepgram’s streaming STT offers sub-300 ms transcription with >90 % accuracy out-of-the-box.
  • The new Aura-2 TTS model introduces enterprise-grade voices and sub-200 ms time-to-first-byte (TTFB) — 2–4× faster than many incumbents.
  • Pair both with an LLM layer (GPT-4o, Claude 3, Gemma 2, etc.) to build a conversational loop that feels “instant” to users.

Follow the step-by-step tutorial and repo below, then claim $200 in free Deepgram credits to test your own STT NLP TTS stack.


“Alexa, what’s the weather?” 

That six-word request triggers over two dozen microservices—yet the entire round trip still feels instant. At its core, building a real-time voice interaction involves a straightforward flow: speech-to-text → language reasoning → text-to-speech

Master that cascade, and you can build anything from hands-free CRMs to real-time game narrators.

Why does this matter in 2025?
Voice users now expect sub-second back-and-forth. Miss that mark, and they tap the screen instead. The surest way to stay under the human turn-taking threshold (≈800 ms) is the proven STT → NLP → TTS pipeline.

What you’ll learn and build:

  • Transcribe with Deepgram Nova-3—sub-300 ms word finalization and industry-leading WER.
  • Reason with GPT-4o-mini (or a drop-in open-source model) for natural, tool-aware replies.
  • Speak with Deepgram Aura-2—< 200 ms TTFB (Time to First Byte) and 40-plus professional voices.

By the end, you’ll have:

  • A runnable repo and sub-2,500 ms round-trip latency benchmark (and a roadmap to < 800 ms with edge optimization).
  • A production-ready blueprint you can drop into call centres, IVRs, in-game companions, or any workflow that needs a voice.

Author’s Note: Remember, even though you can build this STT → NLP → TTS Pipeline yourself using this tutorial, there already exists an API that does all of this work for you out-of-the-box! So if you don’t feel like building something from scratch, click here.

🎙️ Stage 1: Choosing the Right STT (Why Deepgram Nova-3)

The first step in any voice agent pipeline is transcription: converting speech to text. This needs to be fast, accurate, and streaming-capable.

We chose Deepgram Nova-3 for its real-time performance:

Metric

Nova-3

Typical Cloud ASR

Median WER (streaming)

6.8%

14–18%

Median final-word latency

< 300 ms

350–500 ms

Cost/hour (English, PAYG)

$0.0077

$0.012–0.020

Source: internal Deepgram benchmark, updated Apr 2025.

Features that matter for our stack:

  • Supports API integration via REST API and WebSocket-based streaming (no lock-in SDK).
  • < 300 ms streaming latency—maps neatly to our 300 ms STT budget and gives us headroom for GPT + TTS.
  • High accuracy on noisy, multi-speaker calls, domain-specific data.
  • One token works for STT and TTS.

You also get features like domain keyword biasing, injecting key terms to the model snap to domain jargon without retraining:

{ "keywords": ["angioplasty", "myo-inversion", "anastomosis"] }

💻 Quick Start: 20-line Nova-3 Streamer

Here are the prerequisites to build the voice AI agent in our demo (see the repo for the complete code):

Let’s get started! 🚀

# stt_stream.py  (⚡ ultra-minimal)

import asyncio, websockets, json, pyaudio, os
DG_KEY = os.getenv("DEEPGRAM_API_KEY") # Ensure you 
URL    = "wss://api.deepgram.com/v1/listen?model=nova-3"

async def main():
    mic = pyaudio.PyAudio().open(format=pyaudio.paInt16,
                                 channels=1, rate=16000,
                                 input=True, frames_per_buffer=4096)
    async with websockets.connect(URL,
          extra_headers={"Authorization": f"Token {DG_KEY}"}) as ws:

        async def sender():           # 🎤 (your microphone) → Deepgram
            while True:
                await ws.send(mic.read(4096, exception_on_overflow=False))

        async def receiver():         # Deepgram → 📄 (real-time transcript)
            async for m in ws:
t = json.loads(msg)["channel"]["alternatives"][0]["transcript"]
                if t: print("👂", t)

        await asyncio.gather(sender(), receiver())

asyncio.run(main())

Run with:

export DEEPGRAM_API_KEY=...
python stt_stream.py

# Speak 🎤 to get the real-time transcription back 📃 ... ✨

In the full agent (see realtime_voice_agent.py) the same call happens inside run_stt():

async def run_stt():
    url = (f"wss://api.deepgram.com/v1/listen?"
           f"model={STT_MODEL}&encoding=linear16&sample_rate={RATE}"
           f"&punctuate=true&interim_results=false")        # STT_MODEL = 'nova-3'
    async with websockets.connect(url,
                extra_headers={"Authorization": f"Token {DG_API}"}) as ws:
        log("🟢 STT WebSocket open")
        await asyncio.gather(stt_sender(ws), stt_receiver(ws))

The snippet above shows how quickly Deepgram Nova-3 integrates via WebSocket, capturing mic audio (audio_q) and streaming accurate transcriptions (utter_q) straight to the next stage.

This ensures clean, punctuated input for the next step—reasoning.

Repeating the Author’s Note: Remember, even though you can build this STT → NLP → TTS Pipeline yourself using this tutorial, there already exists an API that does all of this work for you out-of-the-box! So if you don’t feel like building something from scratch, click here.

🧠 Stage 2: Thinking with an LLM (Why We Pick GPT-4o)

The Language Model (LLM) is the AI’s “brain,” responsible for understanding context and responding naturally. We chose OpenAI’s GPT-4o because of its:

  • Streaming capability and fast token-by-token response (key for overlap with TTS)
  • Low latency first-token generation (~200–400 ms)
  • Powerful reasoning and natural conversation capabilities
  • Multimodal input/output flexibility
  • Optimal for voice assistant tasks, Q&A, and dynamic IVR systems

GPT-4o delivers both:

Model

Avg RTF ↓

Self-host

$/1 K tok

Best-fit

GPT-4o

1.0–1.3 ×

✖️

$0.005-0.01

Open-ended reasoning

Claude 3 Sonnet

0.8-1.1 ×

✖️

$0.003

Task agents

Llama-3 70B

0.6-0.9 ×

✔️ (Ollama/vLLM)

≈ $0.001 (GPU)

On-prem/offline/edge

RTF ≈ “real-time factor” – lower = faster. ≤1.0 means real time.

⚡ LLM Prompting Pattern (from our voice agent script)

We stream user utterances from STT directly into GPT-4o, handling streaming tokens for real-time responsiveness in your voice agent like this:

# excerpt from realtime_voice_agent.py
SYS_PROMPT = (
  "You are a succinct, helpful assistant. Respond in ≤6 words."
  "Keep answers short, direct, friendly."
)

stream = oa_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": user_utt}
    ],
    stream=True,                 # <-- our script streams
    temperature=0.4              # keeps answers tight
)

Note: The SYS_PROMPT ensures GPT-4o-mini keeps responses succinct (≤6 words), making it perfect for voice interactions.

Security tip: 🔒 Strip or hash PII in user_utt before logging or analytics.

You handle token streaming in real-time:

# We micro-batch the token stream straight into the TTS stage (see token_q).

for chunk in stream:
    tok = chunk.choices[0].delta.content
    if tok:
        await token_q.put(tok)

The loop writes each streamed token into token_q so the next stage can start synthesizing before the LLM is done.

The GPT stage is tightly bound to latency. We even log how fast the first token comes back:

if first_tok:
    log(f"⚡ GPT first token {int((time.perf_counter()-t0)*1000)} ms")
    first_tok = False

This gives transparency into how the model performs turn-by-turn.

🔄 Alternative: Llama-3 (Open-source Option)

For security-critical deployments or offline operation, Meta’s Llama-3 70B offers open weights, local inference, and competitive latency:

  • Cost-effective (~$0.001 per 1K tokens on A100 GPU)
  • Fast inference (0.6–0.9× real-time factor)
  • Strong offline performance for privacy-sensitive applications

Always remember to redact PII from transcripts before LLM invocation.

🗣️ Stage 3: Speaking with Deepgram Aura-2

With text in hand, we need < 200 ms TTFB synthesis. Aura-2 gives us TTS first-byte in ≈ 200 ms—critical for staying < 3 s round-trip time (RTT):

With text in hand, the voice response must sound natural and arrive quickly to maintain conversational flow. We chose Deepgram Aura-2 for this stack because:

  • Sub-200 ms Time-to-First-Byte (TTFB)—critical for staying < 3 s round-trip time (RTT).
  • 40+ high-quality voices, multilingual support.
  • Cost-effective streaming ($0.030 per 1,000 characters).
  • Unified authentication with Nova-3 (STT).

For more information, check out this article.

🎧 Aura-2 Streaming Setup (from our voice agent script)

Aura-2 handles real-time audio synthesis through WebSocket. The key lines within the tts_sender() and tts_receiver() involve micro-batching tokens, which are approximately one sentence long, and then pushing them. 

Here's the core pattern we use:

# -- sender: micro-batch tokens every ~180 chars
if tok == "[[FLUSH]]":
    await ws.send(json.dumps({"type": "Flush"}))
    speaking.set()               # mic pauses
else:
    buffer.append(tok)
    if char_len(buffer) >= 180:
        await ws.send({"type":"Speak", "text":"".join(buffer)})

# -- receiver: first PCM byte
if isinstance(msg, bytes):          # PCM chunk
    if not first_audio:
        log("🎧 Aura audio started")  # first byte <– latency probe
        speaking.set()
    spk.play(msg)                  # -> PyAudio out

…and the watchdog that decides when playback is really finished:

# triggers after queue empty 250 ms OR hard ceiling 3 s
if empty_for(queue_empty_wait) or now-last_audio > 3.0:
    finished_playback()          # clears `speaking`, prints RTT

Playback detection lives in tts_receiver() with a three-pronged watchdog:

  1. PlaybackFinished in finished_playback() control frame
  2. PCM queue emptied for ≥ 250 ms
  3. Hard stop after silence_timeout_max (3 s default)

With this loop, the user hears the assistant ~600-1,000 ms after finishing a sentence, then sees (there might be a cold start on your first call, which could increase TTFB):

🌊 Aura finishing playback...
⏱ End-to-end RTT: 812 ms
🎤 You can speak now …

… and the loop repeats. Exactly the UX we want. ✅

Tactic

How

c

Cache TTS for static prompts (FAQs, greetings)

Redis key = MD5(text) → WAV blob

≈ 80 % char cost

Batch STT (Nova-2) for voicemail-style jobs/non-live tasks

/v1/listen file API

35 % cheaper on bulk

Opus instead of PCM to client (STT WS)

48 kHz → 24 kHz Opus

↓ 4× bandwidth (lower egress)

📜 Full Working Script

The complete file lives at realtime_voice_agent/realtime_voice_agent.py (see repo). 

Clone → install deps → run → speak, hear the reply, then speak again when you see “🎤 You can speak now …”:

git clone https://github.com/<you>/realtime_voice_agent.git
cd realtime_voice_agent
python -m venv .venv && source .venv/bin/activate

pip install -r requirements.txt     # pyaudio, websockets, openai, deepgram-sdk, python-dotenv

cp .env.example .env                # add your API keys

python realtime_voice_agent.py

Here is what happens:

  • STT chunks (8 k) stream out every ~167 ms.
  • GPT-4o first token typically arrives 100-200 ms later (logged as ⚡ GPT first token … ms).
  • Aura-2 starts audio in another < 200 ms.
  • The watchdog finishes and prints the RTT log lines with timestamps like:
[ 4.11s] 📝 User: Hi there
[ 5.90s] ⚡ GPT first token 1748 ms
[ 6.51s] 🎧 Aura audio started
[ 7.42s] 🌊 Aura finishing playback...
[ 7.42s] ⏱ End-to-end RTT: 1310 ms

Tweak ALLOW_INTERRUPT=True if you want to talk over the voice (use headphones!).

End-to-End Latency Targets

Stage

Budget

What we observed*

STT final-word

≤ 300 ms

180 – 282 ms

GPT first token

≤ 2000 ms

719 – 1608 ms

TTS TTFB

≤ 250 ms

140 – 237 ms

RTT (P95)

≤ 3 000 ms

812 – 2,854 ms

*MacBook M1 Pro 2021, 48 kHz mic, 51 Mbps Wi-Fi.

Next Steps to Reduce RTT below 1000 ms

In production, you can bring this RTT below 1000 ms with:

  • Bump RATE down to 16 kHz (saves ~6 KB/s over the wire)
  • Replace SYS_PROMPT with a domain-specific policy or function-calling schema.
  • Adjusting SEND_EVERY and streaming buffer size
  • Lower queue_empty_wait to 0.15 s and run the agent and LLM in the same region.

Move inference closer to the edge (via Whisper + local LLM [e.g., Llama-3 via vLLM] + open-source TTS).

🎯 Production Readiness: Scaling and Observability

Once your real-time Voice AI agent works locally, the next challenge is production readiness—making it reliable, observable, and scalable.

Deployment Models

  • Edge deployments: Great for low-latency, geographically dispersed user bases (ideal for consumer-facing apps). Edge-based apps use WebRTC/JS audio capture in the browser → WebSocket relay to STT. Keep each WebSocket < 30 s to avoid edge idle limits.
  • Container deployments: Easier to manage and autoscale, suitable for enterprise or internal-use workloads. Container-based apps run everything server-side (Python + FastAPI/Flask, plus WebSockets), which makes scaling and observability easier. Pin each user to a pod; easier connection reuse, easier GPU LLM colocation too.

For production, containerize your app and autoscale based on concurrent TTS requests using Cloud Run, ECS Fargate, or K8s with a Horizontal Pod Autoscaler.

Observability and Metrics

Treat the voice stack like any other stateless microservice: autoscale horizontally, emit structured metrics, and alert on 95th-percentile latency. 

Use standard metrics to monitor your system:

Set of standard metrics for a modern production-ready Voice AI stack

Log these and expose them to Prometheus (see this guide). If you are running a self-hosted or containerized deployment, add the target to the Prometheus configuration file in the prometheus.yml file. 

Locate the scrape_configs section, and add a new job with the Engine container instance as a target (excerpt from the Deepram docs):

scrape_configs:
  - job_name: voice_agent
    static_configs:
      - targets: ["<ENGINE_INSTANCE_URI>:<HOST_PORT>"]

Add 95th-percentile latency spike alerts in Prometheus Alertmanager to proactively catch slowdowns. 

# Prometheus alertmanager
- alert: RTT_95thHigh
  expr: histogram_quantile(0.95, sum(rate(round_trip_ms_bucket[5m])) by (le)) > 1000
  for: 3m
  labels: { severity: "page" }

With the integration complete, you can now query the collected metrics using the Prometheus web interface or API. You may consider using a tool like Grafana for handling visualization and alerting.

AI-generated Grafana dashboard monitoring STT, GPT, and TTS latencies.

Rate-limit and Token Management

  • Deepgram workspace cap: 120 req/s / key.
  • Multiplex 4*WS per user (stt, tts, heartbeat, analytics) → keep burst ≤ 20 req/s.
  • Rotate keys weekly via CI secret store; invalidate on user logout.

Failure

Auto-Recovery in realtime_voice_agent.py

GPT 5XX

Return "Give me a second…" short reply and retry last user turn asynchronously, max 3

TTS timeout > 3 s

close WS; send 50 ms of zero-audio every 8 s (already in mic_cb).

Partial transcript missing

fall back to last “interim” text or send “one-word ack” (“Sure!”)

STT 4XX

exponential back-off + resume audio queue – code hook ↓

# If STT socket drops, flush mic buffer and resume – code hook ↓

async def stt_sender(ws):
    while True:
        try:
            await ws.send(await audio_q.get())
        except websockets.ConnectionClosed:
            log("🔄  STT reconnect…")
            await asyncio.sleep(0.8)
            return   # enclosing task restarts socket

💰 Cost Optimization Tips

Cost scales with usage. Here’s how to optimize:

Repeating the Author’s Note: Remember, even though you can build this STT → NLP → TTS Pipeline yourself using this tutorial, there already exists an API that does all of this work for you out-of-the-box! So if you don’t feel like building something from scratch, click here.

Enable Opus in Deepgram via &encoding=opus – still < 300 ms decode.

⚠️ Common Pitfalls, Symptoms, and Fixes

Common deployment issues, the symptoms, and immediate fixes:

A list of common pitfalls, symptoms, and fixes for building production-grade voice AI applications.

🏁 Wrap-Up and Next Steps

You now have 🎉

  • A production-grade blueprint: Nova-3 (STT) ➜ GPT-4o (LLM) ➜ Aura-2 (TTS) with sub-3 s round-trip.
  • Drop-in, commented code (see realtime_voice_agent.py) you can fork and run today.
  • ✅ Benchmarks showing industry-leading latency, accuracy, and cost-efficiency

Whether you're building an in-game NPC, a voice-based assistant, or a smart IVR agent, this blueprint gives you a reliable, scalable, and fast starting point.

🎯 What to do next?

🙌 Thanks for building with us

We created this guide to give you real-world code, production insights, and performance confidence.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.