Table of Contents
Deepgram Voice Agent API powered by NVIDIA Nemotron.
This post covers the architecture, the latency numbers we measured, and how to try the stack in under a minute.
The most valuable voice agent use cases live inside environments that often cannot send conversations to a public API. Hospitals running patient intake agents, banks deploying wealth advisors, and federal agencies processing claims all share the same constraint: audio and language-model reasoning have to stay inside their on-prem network, private clouds, or VPCs. For a decade, the most regulated industries have been the ones with the most to gain from voice AI but the least able to adopt it. This post is about the stack that addresses the challenge.
For these teams, the bottleneck has been the deployment pattern rather than the quality of the models. Deepgram solves that. Deepgram provides a voice agent stack where all layers can run on a single infrastructure inside the customer environment. The layers include speech-to-text (STT) or Automatic Speech Recognition (ASR), LLM, text-to-speech (TTS), and the orchestration that connects them into a unified pipeline.
The Deepgram Voice Agent API connects all the layers supported by STT, LLM, and TTS models from multiple providers. For LLM reasoning, Voice Agent API supports both frontier APIs and leading open-source models including NVIDIA Nemotron, as the featured LLM option. Nemotron is NVIDIA's family of open models spanning speech, reasoning, and other modalities. This post focuses on the reasoning models (Nemotron 3 Nano and Nemotron 3 Super), delivered as NVIDIA NIM and optimized for NVIDIA GPUs. NVIDIA Nemotron enables the flexible deployment of Deepgram Voice Agent APIs, on cloud, customer VPCs, on-premise, or on a hybrid configuration. The rest of this post covers what we built, what we measured, and how a developer can try the stack in under a minute.
What it previously took to build a voice agent
Building from raw APIs requires integrating separate components and building conversational logic from scratch.
Voice agent pipelines commonly start with audio coming in from a telephony provider, flowing through a streaming STT model, and passing to an LLM API for reasoning and response generation before the response is converted back to audio via a TTS provider. Between each model, teams must maintain their own application layer (with websocket setup, audio capture, UI state, authentication, connection health, and user management) and middleware layer (with database management, request routing, streaming coordination, prompt assembly, interruption detection, and concurrency handling). Each component is a code path that the team must write, test, and maintain before a single customer conversation can occur.
This pattern has kept production voice agents expensive to build and brittle to operate, and it is the pattern that the Deepgram Voice Agent API is built to replace.
The Voice Agent API lets developers define agent behavior once and manage a single connection. Audio goes in, audio comes out, and the API automatically handles turn-taking, barge-in, LLM streaming, and the BYO-LLM orchestration layer for you.
One example of implementation
With the Voice Agent API absorbing the orchestration layer, the Deepgram stack puts three models in a single pipeline. Voice Agent API supports multiple STT, TTS, and LLM providers. Below is one viable combination of STT, LLM, and TTS models.
- Deepgram Nova 3 (STT) handles streaming speech-to-text. It is tuned for real-world audio at scale, including telephony, short conversational turns, and the low-latency streaming that voice agents need for good interruption handling. In our self-hosted measurements, and inside a customer VPC environment, Nova 3 delivered P50 first-token latency of 198 ms on NVIDIA GPUs inside an AWS VPC.
- NVIDIA Nemotron LLM models handle the reasoning. Nemotron 3 Nano is already live in the Deepgram playground, and Nemotron 3 Super (120B total, 12B active via LatentMoE) is the production reasoning model we benchmarked for this post. Both models are packaged as NVIDIA NIM, GPU-optimized containers that make deploying Nemotron repeatable across environments. For our current cloud deployment, we access Nemotron 3 Super through Amazon Bedrock (fully managed serverless endpoints) via the InvokeModel and Converse APIs.
- Deepgram Aura 2 closes the loop with text-to-speech, delivering natural prosody and streaming audio so the first words reach the user before the LLM has even finished composing its response.
- The Deepgram Voice Agent API sits across all three models. It owns turn-taking, barge-in and interruption, LLM streaming, and a BYO-LLM architecture that currently supports more than 20 providers. It is the layer that replaces the application and middleware layers in the DIY picture above.
Latency: what we measured
A voice agent only feels conversational when the gap between the user finishing their turn and the agent beginning its response is minimal.
The latency metrics below were measured with Deepgram Voice Agent components running in an AWS VPC and NVIDIA Nemotron 3 Super served through Amazon Bedrock (serverless) using the nvidia.nemotron-super-3-120b model ID. The agent delivered a median end-to-end latency under 700 ms and 90th percentile latency less than one second, with audio and application logic all running inside the customer VPC.
| Metric | STT | LLM | TTS | End-to-End Latency | Notes |
|---|---|---|---|---|---|
| P50 (median) | 198 ms | 322 ms | 89 ms | 660 ms | Natural conversational range |
| P90 | 235 ms | 427 ms | 411 ms | 979 ms | Sub-1s latency at 90th percentile |
| P95 | 294 ms | 481 ms | 662 ms | 1,282 ms | Bedrock cold-start variance |
A few details in the table are worth attention.
- Nova 3 delivered less than 200 ms median latency, with P95 staying under 300 ms. The steady tail latency is key for natural interruption handling in live conversations.
- End-to-end median latency was under 700 ms, with P90 under one second. Both numbers sit inside the range for natural conversations.
- Tail variance comes from the current Bedrock-served configuration. We expect customer-managed NIM deployment inside the customer's own cluster will shorten the P95 meaningfully once that path is generally available.
- Effective latency should be less than the end-to-end latency. The Voice Agent API streams LLM output directly into TTS, so the user hears speech starting well before the LLM has finished generating the full response.
Deepgram is continuing to push the boundaries of Voice Agent performance with NVIDIA Nemotron, with a focus on reducing latency while preserving the accuracy and consistency that production deployments demand.
Try it online
Open playground.deepgram.com, select Voice Agent, and pick Nemotron 3 Nano from the Think model dropdown. You will be on the stack in less than 60 seconds. The playground is free to use and requires no onboarding or credit card.
Pay attention to a few things while you talk to the agent:
- How fast the agent responds to your first word.
- What happens when you interrupt the agent mid-sentence (barge-in).
- How natural the Aura 2 TTS voice of the agent sounds over a long reply.
- How the agent handles context across a multi-turn conversation.
For more details, check out Deepgram's Voice Agent Getting Started guide and the open-source Voice Agent template apps. The templates include working reference clients that plug directly into the snippet above.
The deployment paths
Most customers ask for local deployment of Deepgram for latency, data security, and governance reasons. Two paths are or will be available for customers to leverage Nemotron and the Voice Agent stack.
Path 1 (Available today). Deepgram Voice Agent components run inside the customer's AWS VPC and call Nemotron 3 Super through Amazon Bedrock (serverless). This is a deployable pattern for any AWS-native team today. It keeps audio and application logic inside the customer network, while inference runs on a Bedrock-managed endpoint that Amazon operates.
Path 2 (Coming soon). Customer-managed NIM, where Deepgram Voice Agent components run alongside a customer-operated NIM microservice inside the customer's own Kubernetes cluster, bare metal, or datacenter, without a dependency on any cloud marketplace or APIs. This pattern is technically possible but not yet customer-ready, and it is the direction we are building toward together with NVIDIA and with enterprise infrastructure partners.
Deploy it yourself
For teams ready to ship, Deepgram's self-hosted documentation covers the full deployment pattern: Voice Agent components running inside a VPC alongside Bedrock serverless with manifest examples, container image sources, environment variables, GPU sizing notes, and NIM model identifiers for Nemotron 3 Nano and Nemotron 3 Super.
What is next: fine-tuning Nemotron for voice
Voice agent workloads have distinctive characteristics. Turns are short, interruption sensitivity is high, and latency budgets are tight.
Deepgram has started exploratory fine-tuning work on Nemotron reasoning models to produce voice-agent-optimized variants using NVIDIA NeMo. Early experiments have been promising. This complements NVIDIA's existing voice-domain models in the Nemotron family, like Nemotron ASR and Magpie TTS, extending the family's coverage of conversational voice workloads end-to-end. Deepgram's goal is to customize Nemotron LLM for conversational voice: smaller, faster, and cheaper than general-purpose reasoning models, with quality held to what enterprise deployments need.
Three doors, one stack
For enterprise teams with data residency or compliance requirements, the Deepgram solutions team will run a full architecture review covering architecture, compliance, and data residency requirements. Start at deepgram.com/contact-us.
1. Try it in a minute. Visit playground.deepgram.com and look for Nemotron 3 Nano in the Think model dropdown.
2. Read the self-hosted documentation. It covers the full setup for running the joint stack inside your own VPC, datacenter, or Kubernetes cluster.
3. Plan an enterprise rollout. The Deepgram solutions team can walk enterprise customers through the self-hosted architecture and the NIM deployment pattern end to end.
No longer do you need to struggle between cloud convenience and on-prem control for voice agents. With Deepgram and NVIDIA Nemotron models, enterprise teams can have both with best-in-class latency.



