By Stephen Oladele
Last Updated
The fastest path from ‘hello’ to a handled ticket: AI voice agents that listen, think, and act, all on your stack.
Customers don’t want to press 3 for billing. Field techs can’t tap through menus with a wrench in hand. And teams don’t have time to glue together brittle speech-to-text/LLM/text-to-speech chains that fall apart under noise, accents, or interruptions.
The gap has finally closed: modern speech recognition, real-time reasoning, and natural speech synthesis now work together in one streaming loop, so voice agents can actually finish tasks, not just chat.
AI voice agents are now good enough to handle real customer conversations live, hands-free, and interruption-friendly. This is because speech-to-text, tool-using reasoning, and natural speech synthesis operate as a single, low-latency pipeline.
This guide gives you 10 production-grade use cases and patterns you can ship today with a voice-agent stack (STT ↔ LLM/tool calls ↔ TTS), each tied to concrete business outcomes and a 3-step “how to build it” recipe.
A multimodal AI voice agent workflow. Speech from the user is streamed to the Multimodal Model, which then performs the user’s task and responds with speech directly.
Here’s what you’ll learn in this guide for each use case:
- Why voice (not just chat) is the right interface: hands-free workflows, barge-in, noisy channels, faster task completion, and multilingual reach.
- Minimal architecture for each pattern: real-time STT, LLM/tool calls, and TTS (bring your own LLM if you like) plus simple hooks into CRMs, calendars, POS, ticketing, EMR, and billing.
- Latency and UX guardrails that feel human: partials, quick confirms, read-backs, clarifying questions, and graceful barge-ins.
- Safety and reliability basics: redaction where appropriate, human handoff, retries/backoff, observability, and KPIs that prove ROI.
Exactly how to try one now: open a Playground preset (or curl), place a live call, add one tool (e.g., CRM/Calendar/POS), and measure.
How to use this guide:
- Skim the 10 use-case cards and pick the one that matches your highest-volume call type.
- Copy the 3-step build and the Try it now snippet.
- Integrate one tool function (calendar, CRM, POS, ticketing).
- Track the KPIs we list for that pattern; iterate on prompts and vocab.
How We Chose These Use Cases
We prioritized use cases that create immediate, measurable impact and showcase what modern voice agents do best: complete tasks in real time under messy, human conditions (noise, accents, interruptions) while calling your tools (CRM, POS, ticketing, calendars).
Methodology: Reach × Urgency × Differentiation × Fit
We scored each use case on four criteria:
- Reach
- What it Means: Common in real-world ops: high call volumes, common workflows, or routine tasks.
- Why it Matters: Voice agents should handle the most calls, not the rare edge cases.
- Urgency
- What it Means: Business pressure to solve today: missed calls, long hold times, poor CX, lost revenue.
- Why it Matters: If it’s costing you now, it’s worth automating now.
- Differentiation
- What it Means: Voice ≫ chat or UI: fast hands-free actions, interruptibility, multilingual, noisy channel.
- Why it Matters: Voice makes sense when tapping or typing fails.
- Stack Fit
- What it Means: Clean tool/API surface, low data risk, straightforward compliance/handoff.
- Why it Matters: These use cases shine with the right infra and STT/LLM/TTS orchestration.
What to Expect from Each Use Case
Every use case in this guide follows a consistent, engineering-focused format so you can understand, build, and measure it in under 5 minutes.
Here’s the structure we use:
1. Example Scenario
A concrete user moment you’ve likely seen before, e.g., “Reset my router,” or “I want to book a follow-up visit.”
2. Why It’s Useful
The business outcome: shorter wait times, increased order volume, better customer retention, reduced support load, or pipeline acceleration.
3. Why Voice AI Is Needed
What makes this use case better with voice than chat or app UI:
- Real-time natural language understanding (NLU)
- Interruptions and corrections (barge-in, repair)
- Accents and code-switching
- Contextual reasoning + tool use
- Multilingual support
4. How to Build It (3 Steps)
Minimal architecture using a Voice Agent API stack, such as:
- Stream audio to Nova-3 for STT (with barge-in and partials).
- Use an LLM (Deepgram default or BYO) to reason, call tools, and output actions.
- Send responses to Aura-2 for low-latency TTS delivery.
We’ll also highlight where you can plug in your own models, CRMs, POS, or calendar systems.
5. Try It Now
A link to a Playground preset, curl command, or downloadable repo that shows the use case in action with real or mock data.
With the patterns in place, let’s jump right into the first use case: customer support triage!
Use Case 1: Customer Support Triage & Self-Service (≈ 300 words)
Example Scenario
A customer submits a ticket through a voice call and says, “My internet is down.” An agent verifies the account, runs scripted diagnostics, attempts a modem reset, confirms link health, and offers human escalation with the ticket prefilled.
Why it’s Useful (Business Outcome)
The agent deflects Tier-1 tickets and repetitive flows, reduces average handle time (AHT) via faster verification and guided steps, and improves first contact resolution (FCR)/customer satisfaction score (CSAT) by confirming actions and outcomes before the call ends. This gives consistent troubleshooting every time.
Why Voice AI Interface is Needed
- Barge-in and turn-taking: Real callers interrupt (“I already tried that”) without breaking flow.
- Real-time NLU: Quickly extracts account, device, and error cues from messy speech.
- Tool use: The agent invokes getAccount, runDiagnostics, openTicket via function calling to close the loop.
- Robustness: Handles accents/noise better than Dual-Tone Multi-Frequency signaling (DTMF) trees, and most agents support multilingual follow-ups.
How to Build (3 Steps)
1. Open a session and configure:
Connect to the WebSocket and immediately send a Settings message to define audio in/out, agent behavior, and providers.
wss://agent.deepgram.com/v1/agent/converse
2. Add function calls:
Builld function calls like getAccount, runDiagnostics, openTicket so the agent can fetch customer data, follow KB steps, and file or escalate.
Use the Voice Agent FunctionCallRequest/Response flow.
3. Pick the Model Stack:
Use a Deepgram-managed LLM or bring your own (OpenAI/Azure/Bedrock-style providers are supported in settings). Respond with Aura-2 TTS.
(▶️ Try the Customer Support use case in Playground.)
Scott Chancellor,
CEO of Aircall.
"We believe the future of customer communication is intelligent, seamless, and deeply human—and that’s the vision behind Aircall’s AI Voice Agent. To bring it to life, we needed a partner who could match our ambition, and Deepgram delivered. Their advanced Voice Agent API enabled us to build fast without compromising accuracy or reliability. From managing mid-sentence interruptions to enabling natural, human-like conversations, their service performed with precision. Just as importantly, their collaborative approach helped us iterate quickly and push the boundaries of what voice intelligence can deliver in modern business communications."
Use Case 2: Drive-Thru Ordering (QSR)
Example Scenario
A customer pulls up to the speaker and says, “Can I get a double cheeseburger… actually make it a meal.” The agent parses the change mid-utterance, confirms options (size, drink), suggests an upsell (“Would you like apple pie?”), reads back the total, and pushes the ticket to POS.
Why it’s Useful (Business Outcome)
Drive-thru lanes rely heavily on throughput, accuracy, and upsell discipline. A voice agent delivers consistent scripts during surges, protects margins with timely cross-sells, and covers late-night shifts without staffing gaps, raising cars/hour, improving order accuracy, and lifting average order value.
Why Voice AI Interface is Needed
- Low-latency call-and-response ensures that customers do not wait in silence, allowing the system to respond within a few hundred milliseconds.
- Barge-in/turn-taking to handle mid-sentence changes and noise at the lane.
- Tool use (Function Calling) to add/modify items and send to POS reliably.
- Major QSR pilots have proven industry momentum and lessons learned, indicating clear benefits when handling latency and accuracy well.
How to Build (3 Steps)
1. Prompt with menu schema and upsell rules
Initialize the session over WebSocket and send a Settings message that loads a compact menu JSON (SKUs, options, pricing) and simple upsell policy (“If entree=X and no dessert, suggest Y”).
wss://agent.deepgram.com/v1/agent/converse
2. Expose POS tools
Register functions the agent can call during the conversation: addLineItem(sku, qty, options), finalizeOrder(), sendToPOS(orderId). Implement via the FunctionCallRequest/Response flow.
Here’s what a Python skeleton handler for the function call looks like:
async for raw in ws:
msg = json.loads(raw)
if msg.get("type") == "FunctionCallRequest":
for fn in msg.get("functions", []):
name = fn["name"]; args = json.loads(fn.get("arguments","{}"))
if name == "addLineItem":
result = {"ok": True, "lineCount": 1, "cartTotalUSD": "11.49"}
elif name == "finalizeOrder":
result = {"ok": True, "orderId": "ORD-AB12CD", "totalUSD": "11.49"}
elif name == "sendToPOS":
result = {"ok": True, "orderId": args.get("orderId"), "posStatus": "ACCEPTED"}
else:
result = {"ok": False, "error": "unknown function"}
await ws.send(json.dumps({
"type": "FunctionCallResponse",
"id": fn.get("id"),
"name": name,
"content": json.dumps(result)
}))
The content is usually a JSON string result that the agent can reason over.
3. Choose model stack and speaking voice
You can keep Deepgram-managed OpenAI for “think” and Aura-2 for “speak,” or bring your own LLM/TTS by changing agent.think.provider/agent.speak.provider in the Settings message. For example:
(A) Default: Deepgram LLM (OpenAI) + Aura-2
agent: {
listen: { provider: { type: "deepgram", model: "nova-3" } },
think: { provider: { type: "open_ai", model: "gpt-4o-mini", temperature: 0.3 }, prompt, functions: [...] },
speak: { provider: { type: "deepgram", model: "aura-2-thalia-en" } }
}
Aura-2 models use the form aura-2-<voice>-<lang>, e.g., aura-2-thalia-en.
(B) BYO LLM (custom endpoint)
agent: {
listen: { provider: { type: "deepgram", model: "nova-3" } },
think: {
provider: { type: "google", model: "gemini-1.5-pro" }, // or groq / aws_bedrock / anthropic
endpoint: { url: "https://your-llm.example.com/invoke", headers: { authorization: "Bearer <token>" } },
prompt, functions: [...]
},
speak: { provider: { type: "deepgram", model: "aura-2-thalia-en" } }
}
When using third-party LLMs (e.g., Google/Groq/Bedrock), set agent.think.endpoint.url and headers. See the docs.
(C) BYO TTS (non-Deepgram)
agent: {
// ...listen & think...
speak: {
provider: { type: "open_ai", model: "tts-1", voice: "alloy" },
endpoint: { url: "https://api.openai.com/v1/audio/speech", headers: { authorization: "Bearer <OPENAI_KEY>" } }
}
}
For non-Deepgram TTS providers, include the speak.endpoint details; otherwise it will fail to speak. For Deepgram TTS, the endpoint isn’t required, because setting model is enough.
Use Case 3: Appointment Booking & Rescheduling (Healthcare/Services)
Example Scenario
A caller says, “I need a dermatology appointment next Thursday after 3 PM.” The agent verifies name/date-of-birth, checks provider availability that matches the constraint, books the first acceptable slot, captures insurance details, and sends an SMS confirmation with date, time, clinic address, and prep notes.
An agent receives a user query and is tasked with selecting the right tool to effectively address the query.
Why it’s Useful (Business Outcome)
Booking and rescheduling consume a large share of inbound volume. Automating the happy path reduces abandoned calls, improves booking conversion, and captures structured data correctly the first time.
Proactive confirmations and reminders reduce no-shows and the manual back-and-forth that inflates average handle time.
Why Voice AI Interface is Needed (H3)
- Constraint solving in real time: “next Thursday after 3 PM, not Dr. Lee, and telehealth only.”
- Function calls to calendar/EMR: query availability, enforce scheduling rules, and write back results.
- Barge-in and turn-taking for clarifications (“Actually make it Friday”) without losing context.
- Multilingual + accents handling to widen access and reduce errors in names/IDs.
How to Build (3 Steps)
1) System prompt with slotting policy
Define business rules in the prompt: required identifiers (full name, DOB), allowed visit types (new/return, in-person/telehealth), slot granularity (e.g., 15 min), and confirmation protocol (read-back + SMS). Include examples for date constraints (“this Friday”, “the 2nd Tuesday in October”).
You are a scheduling agent. Always:
1) Verify identity (name + DOB).
2) Interpret date/time constraints; prefer nearest qualifying slot.
3) Confirm details verbatim; then trigger sendSMS with the final summary.
2) Tools: checkAvailability, bookSlot, sendSMS
Expose three functions the agent can call:
checkAvailability(providerId?, specialty, from, to, constraints)→ list of slotsbookSlot(slotId, patientId, visitType, location)→ {appointmentId, start, provider}sendSMS(patientPhone, message)→ {ok:true}
Implement via FunctionCall Request/Response and return JSON the agent can reason over.
3) Multilingual Nova-3 + Aura-2 for confirmations
Use nova-3 for real-time transcripts (with partials and barge-in) and Aura-2 to read back the appointment details clearly. If needed, enable a second language (e.g., ES) and allow the agent to mirror the caller’s language for confirmations.
Example:
Cognigy's HIPAA-compliant healthcare solutions make tasks like registering new patients and doing regular check-ins easier (e.g., post-surgery), so doctors and nurses can focus on giving critical care. Cognify integrates Deepgram’s models.
Technical Insight:
These systems often use hybrid AI architectures that combine ASR for voice input with LLMs for extracting and comprehending medical context.
Use Case 4: Interactive Voice Response (IVR) Replacement for SMB
Example Scenario
A caller says, “I need last month’s invoice,” or “What are your Saturday hours?” The agent understands the request, answers FAQs directly (hours, address, pricing), or routes to the right queue (Billing vs. Tech Support) with context attached to the transfer.
Why it’s Useful (Business Outcome)
Classic DTMF trees, which require users to press numbers for options, often lead to abandoned phone calls and misrouted inquiries.
A conversational IVR removes keypad friction, cuts time-to-answer, raises containment (self-service success), and reduces abandon rate, especially after hours when staffing is thin.
When handoff is required, the agent passes a clean summary so humans start at context, not “hello.”
Why Voice AI Interface is Needed
- Turn-taking + barge-in for fluid redirection (“Actually, tech support please”), without dead air.
- Function calls to fetch authoritative answers from your FAQ/CRM, then decide whether to resolve or route.
- Single, real-time voice-to-voice stack (STT ↔ LLM ↔ TTS) designed for enterprise responsiveness and control.
How to Build (3 Steps)
1) Prompt with intents + fallback policy
Define core intents (Billing, Tech Support, Sales, Hours/Location, Pricing, Invoices) and a fallback (“I’ll connect you to a specialist”) after two low-confidence tries and escalate with collected context (name, account/email, summary).
Include a short answer style guide: concise, confirm understanding, offer next action.
You are the front door for <Company>. Detect intent quickly, answer FAQs directly, and only then route.
If confidence is low twice, use `routeToQueue(intent, summary, priority)`.
Configure via the Settings message before any audio. Also see Deepgram’s Intent Recognition feature under Audio Intelligence.
2) Tools: getFAQAnswer, routeToQueue
Expose two functions the agent can call at runtime:
- getFAQAnswer(question|topic) → {answer, source} (pull from CMS/knowledge base/CRM)
- routeToQueue(queue, context) → {transferId} (SIP/PSTN/CCaaS handoff)
Implement with FunctionCallRequest/Response so the agent can fetch answers or initiate a transfer with metadata.
3) Choose default or BYO LLM; use Aura-2 neutral voice
Keep a neutral, friendly TTS voice for brand consistency. If you bring your own LLM, keep strict tool schemas (see example payloads) and cap response length to avoid monologues.
Use Case 5: Lead Qualification and Routing (RevOps)
Example Scenario
A prospect calls the inbound demo line: “We’re exploring AI voice for our support team.” The agent captures name, company, role, and email; clarifies the use case and timeline; scores the lead (e.g., MEDDIC/BANT); creates the record in a customer relationship management tool; and books an account executive for the earliest qualifying slot.
The agent then emails a calendar invite and a one-paragraph summary.
Why it’s Useful (Business Outcome)
You get a 24/7 SDR that never misses a lead, asks the same crisp discovery questions every time, and pushes complete records into your CRM. This improves qualified rate, shortens speed-to-meeting, and increases pipeline value with cleaner data and fewer handoffs.
Why Voice AI Interface is Needed (H3)
- Rapid back-and-forth with turn prediction so the dialog feels human.
- Function calls to write leads, compute a score, and schedule across calendars.
- Accents/noise robustness and multilingual follow-ups for global inbound.
- Tone and trust: natural TTS for a professional greeting, concise confirmations, and read-backs that build trust before committing a meeting.
How to Build (3 Steps)
1) Prompt with MEDDIC/BANT rules
Seed the agent with the discovery playbook: budget/authority/need/timeline, 3–4 mandatory questions, and a stop rule once qualification is achieved.
Add a style guide: keep answers <10s, confirm facts, use an email-summary template, and avoid salesy language.
You are a voice SDR. Goals:
1) Capture M (Metrics), E (Economic buyer), D (Pain), D (Decision criteria),
I (Timeline), C (Champion) in <=4 questions.
2) Scoring: +2 if timeline ≤ 60 days; +1 if live/real-time; -1 if research only.
3) If score ≥ 4 → offer 3 slots with AE_US_East; else add to nurture and send recap.
4) Confirm summary, then call bookMeeting(start, duration, attendees).
5) Be concise; never exceed 15-second turns.
Include two example Q&A snippets for each MEDDIC element.
2) Tools: createLead, scoreLead, bookMeeting
- createLead(name, company, email, phone, useCase) → {leadId}
- scoreLead(leadId, meddicJson) → {score, band}
- bookMeeting(leadId, slotISO, duration,region, slotPref) → {eventId, start, calendarUrl}
Return concise JSON so the agent can reason and decide next steps (continue questions vs. schedule).
3) Model and voice
Use default OpenAI via Deepgram for reasoning or swap in your own LLM; pick Aura-2–Odysseus-en for professional tone. Keep temperature: 0.2 for tight phrasing.
Conclusion: From Demo to Production (Checklist)
You’ve got 10 patterns, code templates, and KPI targets. Here’s a pragmatic, staged checklist to take a Voice Agent from “cool demo” to reliable production.
P0: Must-Haves Before Real Traffic
Latency SLOs
- p95: mic → first partial ≤ 250 ms; partial → TTS start ≤ 600 ms; end-to-end round trip ≤ 1.0 s.
- Alert on breaches; surface per-stage timings in logs.
Error budgets and retries
- Define a monthly error budget (e.g., ≤ 0.5% failed interactions).
- Tool calls: timeouts, exponential backoff, and idempotency keys on create/update.
Graceful handoff to human
- One-shot clarify; if confidence stays low or user asks: transfer with summary + transcript + tool context.
- Track handoff rate and post-handoff resolution.
Rate limits and protection
- Apply per-caller and per-IP rate limits; circuit-breaker on upstreams.
- Backpressure: pause listening or respond with partial acks when tools are slow.
P1: Make It Observable, Adaptable, and Testable
Analytics events
- Emit events for: turn times, barge-ins, tool-call outcomes, handoffs, KPI snapshots.
- Tie each session to trace IDs for call→tool→TTS correlation.
Canned fallback prompts
- Short, brand-safe replies for low-confidence cases.
- Include “clarify once → handoff” policy.
A/B Prompt Sets
- Version prompts (A/B/C) with guardrails; rotate weekly.
- Track containment, AHT, CSAT deltas per variant.
Vocabulary and Glossary Updates
- Maintain a domain glossary (SKUs, acronyms, provider names).
- Refresh monthly; validate with domain-term accuracy audits.
P2: Scale and Expand
Multilingual rollout
- Detect language → mirror caller; ensure policy translations are approved.
- Add bilingual read-backs for safety-critical content (addresses, payments).
Channel expansion
- PSTN (SIP/CCaaS), WebRTC, mobile SDK.
- Normalize session analytics across channels; keep turn/latency SLOs identical.
Operational runbooks
- Incident playbooks (latency spike, tool outage), on-call rotations, rollback of prompts/functions, and weekly KPI reviews.
Enterprise or regulated industry? Talk to a voice AI expert → for deployment options (network isolation, redaction policies, on-prem/virtual private cloud).
Deepgram Voice Agent API brings real-time STT + LLM/tooling + TTS into one pipeline so your agents can listen, think, and act—with the speed and control production teams require.


