From Gut Feel to Hard Numbers: Meet the Voice Agent Quality Index (VAQI)

The Voice Agent Quality Index (VAQI) is a new benchmark for measuring conversational performance across timing, interruptions, and response coverage.

5 min read

Why Voice Agent Quality Still Feels Like the Wild West Introducing the Voice‑Agent Quality Index Why We Developed VAQI in the First Place The Test Bed: Food Ordering, on Purpose Methodology at a Glance What We Learned Why VAQI Beats the Back‑of‑the‑Napkin Test From Qualitative to Quantitative—For the First Time

Share this guide

By Dan Mishler

Last UpdatedJun 20, 2025

Why Voice Agent Quality Still Feels Like the Wild West

If you’re building a Voice Agent application, you already know this truth: users don’t hang up or ask for a human because a voice bot gets a transcription slightly wrong—they quit because the conversation just feels…annoying.

Annoyance hides in three places:

Interruptions – The bot barges in while the caller is mid‑thought.
Long, awkward gaps – The user wonders whether the bot hung up.
Missed response windows – The customer stops, gives the hint “your turn,” and… nothing.

Individually, each metric tells only part of the story, which is why teams end up falling back on qualitative reviews: “That demo sounded pretty good.” But in the world of non-deterministic outputs, relying on qualitative analysis for applications deployed at scale is risky at best. And at Deepgram, we decided “pretty good” is not good enough for enterprise SLAs, so we built a scoring system that turns “feels right” into a hard, quantifiable metric.

Introducing the Voice‑Agent Quality Index

The Voice‑Agent Quality Index (VAQI) condenses the three timing pillars—interruptions (I), missed response windows (M), and latency (L)—into a single 0‑to‑100 score:

🎥 Watch: How to Measure Voice Agent Quality
See VAQI in action with side-by-side demos of ElevenLabs, OpenAI, and Deepgram in a real-world ordering scenario.

Normalization Methodology:
Interruptions (40% weight): Within each conversation, we divided each provider's interruption count by the highest interruption count among all providers for that conversation, creating a 0-1 scale where 0 = fewest interruptions (best) and 1 = most interruptions (worst). This file-specific approach accounts for the fact that some conversations are inherently more challenging (even for a human!) and naturally provoke more interruptions across all providers.

Missed Responses (40% weight): Similar to interruptions, we normalized missed regions within each conversation by dividing each provider's missed count by the maximum missed count for that conversation. This ensures that a provider missing 3 out of 5 regions on an easy conversation is penalized more heavily than missing 3 out of 14 regions on a difficult conversation with many EoT opportunities.

Latency (20% weight): We first applied a log transformation (log(1 + latency)) to reduce the impact of extreme outliers, then normalized within each conversation by dividing by the maximum log-transformed latency for that conversation. The log transformation prevents a few catastrophically slow responses from completely dominating the score while still penalizing high latency appropriately.

This approach ensures that VAQI scores reflect relative performance rather than absolute numbers, making the metric robust across different audio difficulties and preventing any single "hard" conversation or extreme outliers from skewing the overall rankings.

Why We Developed VAQI in the First Place

Deepgram’s enterprise customers kept telling us one thing:

Speed alone doesn’t guarantee grace. A sub‑300ms STT model feeding a large‑language model that stalls for four seconds is still a four‑second wait. Conversely, an agent with impeccable language understanding is useless if it talks over people. For Deepgram to deliver a Voice Agent API that delights our customers, we needed a yardstick that rewarded a balanced approach. That yardstick became VAQI.

The Test Bed: Food Ordering, on Purpose

We picked a food‑ordering scenario as our reference dialog—a deceptively simple task that packs most of the landmines voice agents struggle with in a condensed form:

Natural pauses & fillers – “Hi, um…. can I get a…”
Contradictory statements – “Make that large…. no, sorry, medium.”
Background noise – A restaurant kitchen track layered onto the audio.
Sparse and sporadic response windows – Only a handful of ideal “agent can/should speak now” spots.

In other words, we used the hard audio that enterprises have to deal with every day, not perfectly polished, studio-grade recordings that you’d attain through a recording booth.

Methodology at a Glance

Enterprise focus – We streamed pre‑recorded calls as 16 kHz PCM over secure websockets to five live providers: Deepgram, OpenAI, ElevenLabs, and Azure.
50 ms chunks – All agents received identical slices, synchronized to the microsecond.
Multiple passes, same audio – Each provider processed the same call at least ten times. LLM reasoning, TTS load, and network jitter introduce natural variance; repeated runs let us neutralize flukes and report reliable averages.
Full‑stack timestamps – We captured provider events, aligned them back to the start of the WAV file, and calculated I, M, and L for every run.
Outlier control – Results beyond the 95th percentile latency or with obvious disconnect faults were flagged, but not discarded—VAQI is designed to punish brittle behavior.

What We Learned

Single‑metric bragging rights are easy to game. One vendor showed a near‑perfect interruption rate but would have intermittent, inexplicable multi-second delays. Their VAQI still landed below 70/100 because the conversation dragged.
Latency dominates perception once it crosses ~3 s. Runs with I and M near zero but L ≥ 3 s scored in the mid‑50s—users perceive silence as incompetence.
Balance wins. Deepgram combined sub‑second response latency with almost no barging‑in and zero missed response on half the runs, topping the charts at 70+.

Why VAQI Beats the Back‑of‑the‑Napkin Test

Actionable Targets – Engineering teams can chase concrete sub‑scores: cut interruptions by 50% or shave 300 ms off latency.
Procurement Clarity – Vendor A at 72 and Vendor B at 68? You now know which POC is worth the next sprint—and why.
End‑to‑End Accountability – VAQI doesn’t care whether the lag comes from ASR, LLM inference, or TTS synthesis. If the user waits, the score drops. Perfect incentive alignment.

From Qualitative to Quantitative—For the First Time

Until now, “pleasantness” was the last unmeasured frontier in voice automation. With VAQI we’ve proven that the feel of a conversation can be captured, scored, and tracked over time. That turns casual opinions into hard data—and data is something every executive, product manager, and engineer can act on.

But let’s be clear: a 71.5 is not a perfect 100. Deepgram currently tops our internal voice agent leaderboard because we’ve invested heavily in fast, accurate STT, natural TTS voices, and precise End‑of‑Thought detection, yet we have targeted many areas for continued and rapid improvement. VAQI isn’t a victory lap; it’s a scoreboard that keeps us honest – and hungry.

In the coming months we’ll publish updated VAQI tables, side‑by‑side audio samples, and deep dives into what moved each number. If you’re building or buying a voice agent, keep an eye on this space and hold us, and every vendor you evaluate, to a higher standard. Because the agents that sound effortless are the ones customers call again tomorrow.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.