By Stephen Oladele
Last Updated
⏩ TL;DR
- Voice AI architecture built on STT → NLP → TTS still delivers the lowest latency and the greatest flexibility for customer-facing apps.
- IMPORTANT NOTE: If you want to develop this pipeline yourself, this article functions as a tutorial. However, if you’d like everything streamlined for you out-of-the-box, check out our Voice Agent API!
- Deepgram’s streaming STT offers sub-300 ms transcription with >90 % accuracy out-of-the-box.
- The new Aura-2 TTS model introduces enterprise-grade voices and sub-200 ms time-to-first-byte (TTFB) — 2–4× faster than many incumbents.
- Pair both with an LLM layer (GPT-4o, Claude 3, Gemma 2, etc.) to build a conversational loop that feels “instant” to users.
Follow the step-by-step tutorial and repo below, then claim $200 in free Deepgram credits to test your own STT NLP TTS stack.
“Alexa, what’s the weather?”
That six-word request triggers over two dozen microservices—yet the entire round trip still feels instant. At its core, building a real-time voice interaction involves a straightforward flow: speech-to-text → language reasoning → text-to-speech.
Master that cascade, and you can build anything from hands-free CRMs to real-time game narrators.
Why does this matter in 2025?
Voice users now expect sub-second back-and-forth. Miss that mark, and they tap the screen instead. The surest way to stay under the human turn-taking threshold (≈800 ms) is the proven STT → NLP → TTS pipeline.
What you’ll learn and build:
- Transcribe with Deepgram Nova-3—sub-300 ms word finalization and industry-leading WER.
- Reason with GPT-4o-mini (or a drop-in open-source model) for natural, tool-aware replies.
- Speak with Deepgram Aura-2—< 200 ms TTFB (Time to First Byte) and 40-plus professional voices.
By the end, you’ll have:
- A runnable repo and sub-2,500 ms round-trip latency benchmark (and a roadmap to < 800 ms with edge optimization).
- A production-ready blueprint you can drop into call centres, IVRs, in-game companions, or any workflow that needs a voice.
|
Author’s Note: Remember, even though you can build this STT → NLP → TTS Pipeline yourself using this tutorial, there already exists an API that does all of this work for you out-of-the-box! So if you don’t feel like building something from scratch, click here. |
🎙️ Stage 1: Choosing the Right STT (Why Deepgram Nova-3)
The first step in any voice agent pipeline is transcription: converting speech to text. This needs to be fast, accurate, and streaming-capable.
We chose Deepgram Nova-3 for its real-time performance:
|
Metric |
Nova-3 |
Typical Cloud ASR |
|
Median WER (streaming) |
6.8% |
14–18% |
|
Median final-word latency |
< 300 ms |
350–500 ms |
|
Cost/hour (English, PAYG) |
$0.0077 |
$0.012–0.020 |
Source: internal Deepgram benchmark, updated Apr 2025.
Features that matter for our stack:
- Supports API integration via REST API and WebSocket-based streaming (no lock-in SDK).
- < 300 ms streaming latency—maps neatly to our 300 ms STT budget and gives us headroom for GPT + TTS.
- High accuracy on noisy, multi-speaker calls, domain-specific data.
- One token works for STT and TTS.
You also get features like domain keyword biasing, injecting key terms to the model snap to domain jargon without retraining:
{ "keywords": ["angioplasty", "myo-inversion", "anastomosis"] }💻 Quick Start: 20-line Nova-3 Streamer
Here are the prerequisites to build the voice AI agent in our demo (see the repo for the complete code):
- Python ≥ 3.9
- PortAudio/PyAudio (for mic and playback)
- A Deepgram API key with Nova-3 and Aura-2 access (sign up for free 200 USD credits; should be more than enough for a starter app).
- An OpenAI API key (GPT-4o-mini).
Let’s get started! 🚀
# stt_stream.py (⚡ ultra-minimal)
import asyncio, websockets, json, pyaudio, os
DG_KEY = os.getenv("DEEPGRAM_API_KEY") # Ensure you
URL = "wss://api.deepgram.com/v1/listen?model=nova-3"
async def main():
mic = pyaudio.PyAudio().open(format=pyaudio.paInt16,
channels=1, rate=16000,
input=True, frames_per_buffer=4096)
async with websockets.connect(URL,
extra_headers={"Authorization": f"Token {DG_KEY}"}) as ws:
async def sender(): # 🎤 (your microphone) → Deepgram
while True:
await ws.send(mic.read(4096, exception_on_overflow=False))
async def receiver(): # Deepgram → 📄 (real-time transcript)
async for m in ws:
t = json.loads(msg)["channel"]["alternatives"][0]["transcript"]
if t: print("👂", t)
await asyncio.gather(sender(), receiver())
asyncio.run(main())Run with:
export DEEPGRAM_API_KEY=...
python stt_stream.py
# Speak 🎤 to get the real-time transcription back 📃 ... ✨In the full agent (see realtime_voice_agent.py) the same call happens inside run_stt():
async def run_stt():
url = (f"wss://api.deepgram.com/v1/listen?"
f"model={STT_MODEL}&encoding=linear16&sample_rate={RATE}"
f"&punctuate=true&interim_results=false") # STT_MODEL = 'nova-3'
async with websockets.connect(url,
extra_headers={"Authorization": f"Token {DG_API}"}) as ws:
log("🟢 STT WebSocket open")
await asyncio.gather(stt_sender(ws), stt_receiver(ws))The snippet above shows how quickly Deepgram Nova-3 integrates via WebSocket, capturing mic audio (audio_q) and streaming accurate transcriptions (utter_q) straight to the next stage.
This ensures clean, punctuated input for the next step—reasoning.
|
Repeating the Author’s Note: Remember, even though you can build this STT → NLP → TTS Pipeline yourself using this tutorial, there already exists an API that does all of this work for you out-of-the-box! So if you don’t feel like building something from scratch, click here. |
🧠 Stage 2: Thinking with an LLM (Why We Pick GPT-4o)
The Language Model (LLM) is the AI’s “brain,” responsible for understanding context and responding naturally. We chose OpenAI’s GPT-4o because of its:
- Streaming capability and fast token-by-token response (key for overlap with TTS)
- Low latency first-token generation (~200–400 ms)
- Powerful reasoning and natural conversation capabilities
- Multimodal input/output flexibility
- Optimal for voice assistant tasks, Q&A, and dynamic IVR systems
GPT-4o delivers both:
|
Model |
Avg RTF ↓ |
Self-host |
$/1 K tok |
Best-fit |
|
GPT-4o |
1.0–1.3 × |
✖️ |
$0.005-0.01 |
Open-ended reasoning |
|
Claude 3 Sonnet |
0.8-1.1 × |
✖️ |
$0.003 |
Task agents |
|
Llama-3 70B |
0.6-0.9 × |
✔️ (Ollama/vLLM) |
≈ $0.001 (GPU) |
On-prem/offline/edge |
RTF ≈ “real-time factor” – lower = faster. ≤1.0 means real time.
⚡ LLM Prompting Pattern (from our voice agent script)
We stream user utterances from STT directly into GPT-4o, handling streaming tokens for real-time responsiveness in your voice agent like this:
# excerpt from realtime_voice_agent.py
SYS_PROMPT = (
"You are a succinct, helpful assistant. Respond in ≤6 words."
"Keep answers short, direct, friendly."
)
stream = oa_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYS_PROMPT},
{"role": "user", "content": user_utt}
],
stream=True, # <-- our script streams
temperature=0.4 # keeps answers tight
)Note: The SYS_PROMPT ensures GPT-4o-mini keeps responses succinct (≤6 words), making it perfect for voice interactions.
Security tip: 🔒 Strip or hash PII in user_utt before logging or analytics.
You handle token streaming in real-time:
# We micro-batch the token stream straight into the TTS stage (see token_q).
for chunk in stream:
tok = chunk.choices[0].delta.content
if tok:
await token_q.put(tok)The loop writes each streamed token into token_q so the next stage can start synthesizing before the LLM is done.
The GPT stage is tightly bound to latency. We even log how fast the first token comes back:
if first_tok:
log(f"⚡ GPT first token {int((time.perf_counter()-t0)*1000)} ms")
first_tok = FalseThis gives transparency into how the model performs turn-by-turn.
🔄 Alternative: Llama-3 (Open-source Option)
For security-critical deployments or offline operation, Meta’s Llama-3 70B offers open weights, local inference, and competitive latency:
- Cost-effective (~$0.001 per 1K tokens on A100 GPU)
- Fast inference (0.6–0.9× real-time factor)
- Strong offline performance for privacy-sensitive applications
Always remember to redact PII from transcripts before LLM invocation.
🗣️ Stage 3: Speaking with Deepgram Aura-2
With text in hand, we need < 200 ms TTFB synthesis. Aura-2 gives us TTS first-byte in ≈ 200 ms—critical for staying < 3 s round-trip time (RTT):
With text in hand, the voice response must sound natural and arrive quickly to maintain conversational flow. We chose Deepgram Aura-2 for this stack because:
- Sub-200 ms Time-to-First-Byte (TTFB)—critical for staying < 3 s round-trip time (RTT).
- 40+ high-quality voices, multilingual support.
- Cost-effective streaming ($0.030 per 1,000 characters).
- Unified authentication with Nova-3 (STT).
For more information, check out this article.
🎧 Aura-2 Streaming Setup (from our voice agent script)
Aura-2 handles real-time audio synthesis through WebSocket. The key lines within the tts_sender() and tts_receiver() involve micro-batching tokens, which are approximately one sentence long, and then pushing them.
Here's the core pattern we use:
# -- sender: micro-batch tokens every ~180 chars
if tok == "[[FLUSH]]":
await ws.send(json.dumps({"type": "Flush"}))
speaking.set() # mic pauses
else:
buffer.append(tok)
if char_len(buffer) >= 180:
await ws.send({"type":"Speak", "text":"".join(buffer)})
# -- receiver: first PCM byte
if isinstance(msg, bytes): # PCM chunk
if not first_audio:
log("🎧 Aura audio started") # first byte <– latency probe
speaking.set()
spk.play(msg) # -> PyAudio out…and the watchdog that decides when playback is really finished:
# triggers after queue empty 250 ms OR hard ceiling 3 s
if empty_for(queue_empty_wait) or now-last_audio > 3.0:
finished_playback() # clears `speaking`, prints RTTPlayback detection lives in tts_receiver() with a three-pronged watchdog:
- PlaybackFinished in finished_playback() control frame
- PCM queue emptied for ≥ 250 ms
- Hard stop after silence_timeout_max (3 s default)
With this loop, the user hears the assistant ~600-1,000 ms after finishing a sentence, then sees (there might be a cold start on your first call, which could increase TTFB):
🌊 Aura finishing playback...
⏱ End-to-end RTT: 812 ms
🎤 You can speak now …… and the loop repeats. Exactly the UX we want. ✅
|
Tactic |
How |
c |
|
Cache TTS for static prompts (FAQs, greetings) |
Redis key = MD5(text) → WAV blob |
↓ ≈ 80 % char cost |
|
Batch STT (Nova-2) for voicemail-style jobs/non-live tasks |
/v1/listen file API |
35 % cheaper on bulk |
|
Opus instead of PCM to client (STT WS) |
48 kHz → 24 kHz Opus |
↓ 4× bandwidth (lower egress) |
📜 Full Working Script
The complete file lives at realtime_voice_agent/realtime_voice_agent.py (see repo).
Clone → install deps → run → speak, hear the reply, then speak again when you see “🎤 You can speak now …”:
git clone https://github.com/<you>/realtime_voice_agent.git
cd realtime_voice_agent
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt # pyaudio, websockets, openai, deepgram-sdk, python-dotenv
cp .env.example .env # add your API keys
python realtime_voice_agent.pyHere is what happens:
- STT chunks (8 k) stream out every ~167 ms.
- GPT-4o first token typically arrives 100-200 ms later (logged as ⚡ GPT first token … ms).
- Aura-2 starts audio in another < 200 ms.
- The watchdog finishes and prints the RTT log lines with timestamps like:
[ 4.11s] 📝 User: Hi there
[ 5.90s] ⚡ GPT first token 1748 ms
[ 6.51s] 🎧 Aura audio started
[ 7.42s] 🌊 Aura finishing playback...
[ 7.42s] ⏱ End-to-end RTT: 1310 msTweak ALLOW_INTERRUPT=True if you want to talk over the voice (use headphones!).
End-to-End Latency Targets
|
Stage |
Budget |
What we observed* |
|
STT final-word |
≤ 300 ms |
180 – 282 ms |
|
GPT first token |
≤ 2000 ms |
719 – 1608 ms |
|
TTS TTFB |
≤ 250 ms |
140 – 237 ms |
|
RTT (P95) |
≤ 3 000 ms |
812 – 2,854 ms |
*MacBook M1 Pro 2021, 48 kHz mic, 51 Mbps Wi-Fi.
Next Steps to Reduce RTT below 1000 ms
In production, you can bring this RTT below 1000 ms with:
- Bump RATE down to 16 kHz (saves ~6 KB/s over the wire)
- Replace SYS_PROMPT with a domain-specific policy or function-calling schema.
- Adjusting SEND_EVERY and streaming buffer size
- Lower queue_empty_wait to 0.15 s and run the agent and LLM in the same region.
Move inference closer to the edge (via Whisper + local LLM [e.g., Llama-3 via vLLM] + open-source TTS).
🎯 Production Readiness: Scaling and Observability
Once your real-time Voice AI agent works locally, the next challenge is production readiness—making it reliable, observable, and scalable.
Deployment Models
- Edge deployments: Great for low-latency, geographically dispersed user bases (ideal for consumer-facing apps). Edge-based apps use WebRTC/JS audio capture in the browser → WebSocket relay to STT. Keep each WebSocket < 30 s to avoid edge idle limits.
- Container deployments: Easier to manage and autoscale, suitable for enterprise or internal-use workloads. Container-based apps run everything server-side (Python + FastAPI/Flask, plus WebSockets), which makes scaling and observability easier. Pin each user to a pod; easier connection reuse, easier GPU LLM colocation too.
For production, containerize your app and autoscale based on concurrent TTS requests using Cloud Run, ECS Fargate, or K8s with a Horizontal Pod Autoscaler.
Observability and Metrics
Treat the voice stack like any other stateless microservice: autoscale horizontally, emit structured metrics, and alert on 95th-percentile latency.
Use standard metrics to monitor your system:
Set of standard metrics for a modern production-ready Voice AI stack
Log these and expose them to Prometheus (see this guide). If you are running a self-hosted or containerized deployment, add the target to the Prometheus configuration file in the prometheus.yml file.
Locate the scrape_configs section, and add a new job with the Engine container instance as a target (excerpt from the Deepram docs):
scrape_configs:
- job_name: voice_agent
static_configs:
- targets: ["<ENGINE_INSTANCE_URI>:<HOST_PORT>"]Add 95th-percentile latency spike alerts in Prometheus Alertmanager to proactively catch slowdowns.
# Prometheus alertmanager
- alert: RTT_95thHigh
expr: histogram_quantile(0.95, sum(rate(round_trip_ms_bucket[5m])) by (le)) > 1000
for: 3m
labels: { severity: "page" }With the integration complete, you can now query the collected metrics using the Prometheus web interface or API. You may consider using a tool like Grafana for handling visualization and alerting.
AI-generated Grafana dashboard monitoring STT, GPT, and TTS latencies.
Rate-limit and Token Management
- Deepgram workspace cap: 120 req/s / key.
- Multiplex 4*WS per user (stt, tts, heartbeat, analytics) → keep burst ≤ 20 req/s.
- Rotate keys weekly via CI secret store; invalidate on user logout.
|
Failure |
Auto-Recovery in realtime_voice_agent.py |
|
GPT 5XX |
Return "Give me a second…" short reply and retry last user turn asynchronously, max 3 |
|
TTS timeout > 3 s |
close WS; send 50 ms of zero-audio every 8 s (already in mic_cb). |
|
Partial transcript missing |
fall back to last “interim” text or send “one-word ack” (“Sure!”) |
|
STT 4XX |
exponential back-off + resume audio queue – code hook ↓ |
# If STT socket drops, flush mic buffer and resume – code hook ↓
async def stt_sender(ws):
while True:
try:
await ws.send(await audio_q.get())
except websockets.ConnectionClosed:
log("🔄 STT reconnect…")
await asyncio.sleep(0.8)
return # enclosing task restarts socket💰 Cost Optimization Tips
Cost scales with usage. Here’s how to optimize:
|
Repeating the Author’s Note: Remember, even though you can build this STT → NLP → TTS Pipeline yourself using this tutorial, there already exists an API that does all of this work for you out-of-the-box! So if you don’t feel like building something from scratch, click here. |
Enable Opus in Deepgram via &encoding=opus – still < 300 ms decode.
⚠️ Common Pitfalls, Symptoms, and Fixes
Common deployment issues, the symptoms, and immediate fixes:
A list of common pitfalls, symptoms, and fixes for building production-grade voice AI applications.
🏁 Wrap-Up and Next Steps
You now have 🎉
- A production-grade blueprint: Nova-3 (STT) ➜ GPT-4o (LLM) ➜ Aura-2 (TTS) with sub-3 s round-trip.
- Drop-in, commented code (see realtime_voice_agent.py) you can fork and run today.
- ✅ Benchmarks showing industry-leading latency, accuracy, and cost-efficiency
Whether you're building an in-game NPC, a voice-based assistant, or a smart IVR agent, this blueprint gives you a reliable, scalable, and fast starting point.
🎯 What to do next?
- 🔐 Sign up at Deepgram → Get $200 in free credits to jumpstart your voice AI experiments.
- 📘 Read the API docs → Deepgram STT & TTS endpoints.
- 🛠️ Fork the full demo repo → Ready-to-run voice AI agent with full code.
- 📞 Book a 30-min consult → Need help scaling? Our engineers will review your stack and share best practices.
- Check out the Deepgram community.
🙌 Thanks for building with us
We created this guide to give you real-world code, production insights, and performance confidence.



