Table of Contents
A basic speech-to-text model is built to transcribe. So it transcribes. The TV in the background, the echo of your own agent bouncing off a wall, all of it gets turned into text and handed to the model as if it mattered. In a quiet room with a headset, that's fine. In the real world, it doesn’t pass the Turing Test.
When a voice agent misbehaves, the instinct is to blame the model or the prompt. Usually the problem started earlier, when the audio was still audio.
Background noise breaks things in two ways
Background noise causes two problems: a transcript full of errors, and an agent that keeps stopping when nobody interrupted it.
A transcript full of errors means a higher word error rate (WER - the percentage of words the model gets wrong). This is the classic speech-to-text problem.
A misbehaving agent kills customer experience. Voice agents listen for the caller to start talking so they know when to stop and listen. But, a dog bark or a slammed cabinet can read as "the human is talking now," so the agent stops mid-sentence and waits for a person who never said anything. How awkward!
Voice quality is a system, not a hack
There's no singular way to fix bad audio inputs, because they pass through several hands before the model ever sees them.
At the hardware layer, beamforming helps. A microphone array can be steered to favor sound from the speaker's direction and suppress the rest, like cupping your hands around your ears in a crowd. The right hardware solution is highly environment-dependent, and Deepgram has industry-specific perspectives (read about our restaurant perspective here).
At the network layer, telephony providers matter more than people think. A call routed badly arrives degraded no matter how good the model is. Deepgram has deep relationships with telephony providers to solve for this layer.
At the model layer, you get noise reduction, echo cancellation (so the agent doesn't transcribe its own voice coming back at it), and custom acoustic models trained on the conditions you operate in. This is where the model you pick matters most. A general-purpose transcription model turns audio into text to read later. A frontier model purpose-built for voice agents is trained on the messy, real-time conditions an agent lives in, so it handles noise and overlap as a human would. That’s Deepgram’s bread and butter.
At the application layer, you can pass along metadata about overlapping speech, so the system downstream knows two people were talking at once. With that flag in hand, the agent can lower its confidence on the words it caught during the overlap, or ask the caller to repeat the part it couldn't separate, instead of treating a guess as a fact. This is the extra credit that Deepgram has signed up for.
Each layer does something the others can't, and skipping one shows up above it.
Teaching an agent who it's talking to
To improve the model layer, teams reach for diarization first, the feature that labels who said what (Speaker A, Speaker B). It's usually not enough.
When there was no Speaker B – just a car honking – metadata should come in to tell the agent whether to ignore the noise, treat it as an interruption, or ask for clarification.
When there was a Speaker B, now it’s a primary speaker identification problem: figuring out which voice the agent is serving. Most systems treat all speech as equal. Diarization tells you Speaker A and Speaker B are different people, but not which one is the ordering customer and which one is the friend in the background. Primary speaker identification flags the person the agent serves as primary and treats every other voice as non-primary.
We go a step further with a rolling enrollment window: a short, updating sample of recent audio the model uses to learn the primary speaker. When the important voice changes partway through, like a manager taking over from an employee, it follows the primary speaker to the new voice instead of locking onto the first person it heard.
This is why we build voice models specifically for agents rather than repurposing a transcription model. Primary speaker identification only works when it runs inside the model, on the raw audio, in real time.
As voice agents start handling work a business can't afford to get wrong, what matters is whether the agent understood, not how few words it missed. A model can post a great word error rate and still confirm the wrong order.
Test us in production, in your loudest room, and you'll hear the difference between a model that transcribed the words, and one that understood them. And, of course, feel free to contact our team here if you’re interested in more information!








