Table of Contents
We've written about why bad audio, not bad AI, is what breaks most voice agents, and why the fix spans the hardware, network, model, and application layers together (start here). Restaurants are where all of that stops being theoretical. They stack every one of those problems on top of each other, in the same few seconds, with a paying customer waiting.
This blog is the restaurant deep-dive: the acoustics, the failure modes, and what the model has to do about them.
The acoustic environment in a restaurant
Count the noise sources inside a restaurant. Fryer hiss, an exhaust hood, a drink machine, POS beeps, coworkers yapping, music overhead, and hard surfaces reflecting all of it back as echo. Drive up to a drive-thru and add a car engine and wind. The signal-to-noise ratio (the level of the speaker's voice compared to everything else) can drop low enough that the words you care about are quieter than the environment around them.
Then add the agent's own voice. Its speech plays through a speaker inches from the same microphone, so without acoustic echo cancellation the model transcribes itself and thinks it's interrupting itself.
More than one mouth
Another hard problem is that a restaurant order is rarely one person talking. A few cases that break a naive setup:
- Two kiosks side by side. Customer A says "large fries" while customer B, two feet away, says "no onions," and a first-speaker system splices them into one order.
- A friend chiming in. The customer orders "a number three combo," and the person next to them says "make that two." If the agent can't tell the customer's voice from the friend's, it doesn't know what "two" means here: the number two combo, or two of the threes?
- A kid begging for a cookie. The customer is ordering while their kid shouts "I want a cookie" in the background. The agent has to recognize that the cookie came from a non-primary voice the parent hasn't agreed to, and leave it off the order.
What models can do to help
Diarization (separating Speaker A from Speaker B) and "just take the first voice" both fall short here. Neither one answers the question that matters in a crowded room: which voice is the customer the agent is serving?
That's the primary speaker identification problem. The model holds a rolling enrollment window on the primary speaker (the short, updating audio sample it uses to recognize them) and has to tune its length to the context. Too short and it jumps to whoever is loudest; too long and it can't follow a real handoff, like a pair of friends each placing their order.
From there it transcribes everything and tags each span as primary or non-primary, with a confidence score attached. The agent gets a policy instead of a guess: act on high-confidence primary speech, confirm when a primary item lands with low confidence ("you said no onions, correct?"), and ignore or ask about non-primary speech rather than folding it into the order. So "make that two" from the friend gets flagged for agent review, not silently added to the check.
Hardware is vitally important
Before audio reaches a model, the hardware setup is vitally important to the success of a voice agent in restaurants.
Placement comes first. Keep the mic as close to the customer and as far from the speaker as possible, or else the agent hears itself. Echo cancellation helps: feed the speaker's output back in as a reference signal so the model can subtract its own voice. Beamforming (aiming an array of mics at the customer) buys more in loud, open spaces. Outfit mics with wind- and weather-reducing noise barriers.
A catch is the hardware market barely serves this. Most mics are built for podcast booths or quiet desktops, not for a noisy counter or outdoor performance, and quality is hard to measure. So we lean on hardware we've proven in the field, like HME Nexeo, which supports our agents in both drive-thru and also employee-assist headsets. HME’s hardware also provides on-device audio processing, which allows our models to introduce additional layers of quality fine-tuning.
Restaurant audio as a research focus
Our voice researchers spend their time on exactly these conditions: overlapping speakers, real echo, horrible signal-to-noise ratios. A model can post a great word error rate and still ring up the wrong combo, so what we optimize for is whether the agent understood the order (and plugged it into the POS correctly, of course).
Most voice agents are built for quiet rooms. Yours isn't one, and neither are the conditions we train for. Drop us into your loudest location and listen for yourself.








