VAQI, Revisited: How OpenAI’s gpt‑realtime Stacks Up — With Sensitivity Analysis for Real‑World Priorities

A quick VAQI refresher
Methodology
The latest raw metrics (average across runs)
Sensitivity analysis: “What matters most to you?”
VAQI scores under different weightings
How Priorities Shift the Leaderboard
Equal Weighting (33/33/33)
Interruption/Miss-Heavy (40/40/20)
Latency-Heavy (30/30/40)
Where we go from here

Share this article

By Dan Mishler

Last Updated

Sep 2, 2025

In our last post, we introduced the Voice‑Agent Quality Index (VAQI)—a single score that captures how a voice agent feels to talk to by combining three timing behaviors that matter to humans:

Interruptions (I): the agent talking over you.
Missed End‑of‑Thought windows (M): moments when the user clearly stops and the agent should speak, but doesn’t.
Latency (L): how long the agent takes to start responding after you finish.

Since then, OpenAI released gpt‑realtime, a production‑ready, speech‑to‑speech model that promises lower latency, more natural prosody, and improved tooling for real‑time voice agents. That is exactly the kind of improvement VAQI is designed to measure. Let’s see how it does.

A quick VAQI refresher

Voice Agent Quality Index or VAQI condenses I, M, and L into a 0–100 score (higher is better) after normalizing each factor against the worst performer in the comparison set:

We start with equal weights (⅓ each) to keep things simple and transparent, then we run sensitivity tests so teams can emphasize what matters most to their product: politeness (interruptions), reliability of “your turn” detection (miss rate), or snappiness (latency).

Methodology

We use a food‑ordering scenario as our reference dialogue—a deceptively simple flow that includes natural pauses and fillers (“uh, can I get a…”), occasional contradictions (“make it large—sorry—medium”), realistic background noise, and sparse EOT windows where the agent is expected to speak. Each provider processes the same audio multiple times to smooth over LLM non‑determinism and transient TTS/network variance. We align all events back to the start of the WAV to measure EOT windows precisely.

For this follow‑up we limited comparisons to three providers: Deepgram, OpenAI (gpt‑realtime), and ElevenLabs.

The latest raw metrics (average across runs)

Provider	Runs	Avg Interruptions ↓	Avg Miss Rate ↓	Avg Latency (s) ↓
Deepgram	231	1.20	0.427	0.85
OpenAI (gpt‑realtime)	231	1.26	0.402	2.55
ElevenLabs	231	3.11	0.484	0.53

How to read this: lower is better on all three raw metrics. OpenAI shows the best Miss Rate, Deepgram has the fewest Interruptions with sub‑second‑ish Latency, and ElevenLabs remains the fastest—but with significantly more Interruptions and Misses.

Note on comparability with our first post: VAQI normalizes each factor against the worst performer within the set being compared, scores and relative gaps shift slightly due to the non-deterministic nature of AI models.

Sensitivity analysis: “What matters most to you?”

We computed VAQI under several weightings to reflect different priorities:

Equal: 33/33/33 (I/M/L)
Conversational Accuracy: 40/40/20 (I/M/L)
Latency: 30/30/40 (I/M/L)
(Plus two extended mixes we looked at: 60/20/20 I‑heavy, and 20/60/20 M‑heavy.)

VAQI scores under different weightings

Provider	VAQI 33/33/33	VAQI 40/40/20	VAQI 30/30/40	VAQI 60/20/20	VAQI 20/60/20
Deepgram	46.54	42.56	48.53	52.49	32.63
OpenAI (gpt‑realtime)	25.41	30.49	22.87	39.04	21.95
ElevenLabs	26.40	15.84	31.68	15.84	15.84

Three quick takeaways:

Deepgram remains #1 across all mixes we tested, reflecting a balanced profile: low Interruptions, competitive Miss Rate, and sub‑second‑ish Latency.
OpenAI (gpt‑realtime) benefits when Interruptions/Misses matter more (40/40/20, 60/20/20). Its best‑in‑set Miss Rate pays dividends, and its mean Latency, while higher than Deepgram and ElevenLabs, is notably improved over our earlier OpenAI runs.
ElevenLabs climbs in Latency‑heavy contexts thanks to its very low Latency with a tiny long tail. To improve further, the main levers are reducing Interruptions and Misses without sacrificing speed.

Why sensitivity? Because “best” is contextual. A voice agent for regulated financial services will weigh politeness and turn‑taking discipline more heavily; a hands‑busy experience (e.g., in‑vehicle or on‑device assistants) may prize snappiness above all. The sensitivity table lets you read the same data through your own priority lens.

How Priorities Shift the Leaderboard

The Voice-Agent Quality Index (VAQI) is designed to show how different priorities change outcomes. By adjusting the weighting of Interruptions (I), Misses (M), and Latency (L), we can see how provider performance shifts depending on what matters most.

To make this easier to digest, we visualized three core mixes in a carousel:

Equal Weighting (33/33/33)

Deepgram leads with the strongest overall balance. OpenAI and ElevenLabs trail, with ElevenLabs boosted by fast latency and OpenAI benefiting from its strong Miss Rate.

Interruption/Miss-Heavy (40/40/20)

OpenAI climbs to #2 thanks to its best-in-set Miss Rate, while Deepgram stays #1 by combining low interruptions with competitive latency. ElevenLabs lags due to higher interruptions and misses.

Latency-Heavy (30/30/40)

When speed is the highest priority, the VAQI metric returns these metrics for each provider.

ElevenLabs improves under a latency-first lens, while OpenAI falls back. Deepgram maintains leadership thanks to sub-second latency and low interruptions, showing resilience even in speed-focused contexts.

The takeaway: performance is contextual. VAQI turns subjective impressions into an objective, repeatable score, and the right weighting depends on your use case.

How gpt‑realtime changed the picture

Compared to earlier OpenAI runs we published, gpt‑realtime shows improved average Latency and a stronger Miss Rate (hitting the “right window” to speak more consistently). Under equal weighting with this three‑provider set, ElevenLabs edges OpenAI because Latency becomes the dominant differentiator; but as soon as we emphasize Interruptions/Misses, OpenAI retakes #2. That’s exactly what we’d hope to see when a provider optimizes its speech‑to‑speech path: fewer missed turns without letting Interruption rates creep up. Deepgram’s blend of conversational accuracy and excellent average latency kept it #1 in all scenarios.

Where we go from here

Deepgram leads across mixes—but it isn’t 100/100. We still interrupt occasionally and can shave off more latency. VAQI keeps us honest and focused on what improves real conversations, not just benchmarks.
OpenAI (gpt‑realtime) is trending in the right direction. Tightening the long‑tail Latency will likely move its VAQI materially in “politeness‑first” mixes where it’s already close.
ElevenLabs remains the snappiness reference. Reducing Interruptions/Misses while preserving its Latency edge would shift its standing quickly in equal or I/M‑heavy weightings.

Most importantly, VAQI turns a qualitative reaction—“this bot sounds good”—into a quantitative, repeatable measurement that product, QA, procurement, and engineering can rally around. And by publishing sensitivity analyses, we aim to be even more fair: you can judge performance through the weights that match your brand’s priorities.

We’ll keep running this benchmark as models evolve and will share updates over time. If there’s a weighting or scenario you care about (e.g., medical intake, travel rebooking, or IVR deflection), tell us—we’ll test it and report the numbers.