Naively Training a Wake Word Model from Scratch

The Hook
The Traditional Approach (And Why We Skipped It)
The Synthetic Data Bet
Problem: Teaching a Model What You're NOT
The Augmentation Multiplier
Architecture and Runtime
Results
Try It Yourself

The Hook

The DG Labs team ran a "hands-free week"—an experiment to see how much of our work we could do without touching our computers. Dictation handled most of the typing. Keyboard shortcuts and voice commands covered navigation. But one piece was missing: how do you start a dictation or agent flow when your hands are off the keyboard?

The answer is a wake word. Say a trigger phrase, and your system wakes up and listens. I decided to build one from scratch.

There was just one catch: I'd never trained an audio model before. I barely knew the basics of neural networks, let alone the specifics of keyword spotting—mel spectrograms, depthwise separable convolutions, GRUs. So I did what any reasonable person would do in 2025: I paired up with Claude Code and started building.

So I trained a wake word detector for "Zaphod." (Why Zaphod? Hitchhiker's Guide fan. In hindsight, not the best choice—more on that later.) It worked great—detected the wake word 100% of the time, but it ALSO triggered on "app," "salad," "testing," "working," and basically anything with a vowel.

My 100% recall came with an 85% false accept rate. I'd built a very expensive random number generator.

This is the story of how we fixed that problem—not by collecting thousands of human voice recordings, but by getting strategic with synthetic voices and understanding what the model really needed to learn.

The Traditional Approach (And Why We Skipped It)

The "right" way to build a wake word detector: collect thousands of voice samples from diverse speakers, across different recording conditions, with careful quality control and legal clearance. Cost: thousands of dollars and months of work. And there's the cold start problem—you need data to train a model, but you need a working product to collect data at scale.

The Synthetic Data Bet

Here's the insight that changed everything: modern text-to-speech has crossed the uncanny valley. The voices coming out of services like Deepgram and ElevenLabs aren't just "good enough"—they're genuinely diverse in the ways that matter for training audio models.

We built our entire training pipeline on synthetic voices: 12 from Deepgram Aura 2, 12 from ElevenLabs, plus 500 of my own recordings to make sure it worked for the primary user. Total cost: ~$0.10 for TTS, plus an afternoon of talking to myself.

The key realization: what matters for wake word detection isn't whether the voice is "real"—it's whether you have enough acoustic diversity in your training data. Synthetic voices from multiple providers, with different voice models and settings, give you that diversity without the logistics nightmare of human recording sessions.

Problem: Teaching a Model What You're NOT

This is where most wake word projects fail. It's easy to teach a model to recognize "Zaphod"—just play it "Zaphod" a thousand times in different voices. The model will learn the pattern. It will detect "Zaphod" with near-perfect accuracy.

The hard part is teaching it that "saffron," "zap it," "staff on," and "say pod" are not the wake word.

When we first tested our model, the results were brutal:

Phrase	Model Prediction

"Zaphood"	✅ Wake Word (correct)
"app store"	❌ Wake Word (wrong)
"salad"	❌ Wake Word (wrong)
"testing"	❌ Wake Word (wrong)
"appreciate"	❌ Wake Word (wrong)

Phrase

"Zaphood"

Model Prediction

✅ Wake Word (correct)

The model had learned to detect a loose constellation of sounds—/z/, /æ/, /f/, /d/—rather than the specific word "Zaphod." Every word containing similar phonemes triggered a detection.Why "Zaphod" Was a Terrible Choice (And Why That Made This Interesting)

Here's what I didn't realize when I picked my wake word: "Zaphod" is phonetically cursed.

Break it down:

/z/ — a sibilant that sounds similar to /s/, shared with "safari," "zero," "system"
/æ/ — the "a" in "cat," one of the most common vowel sounds in English, appearing in "app," "add," "salad," "had," "back"
/f/ — shared with "staff," "saffron," "half"
/ɒd/ — the ending sound, shared with "salad," "valid," "method," "period"

Compare that to commercial wake words: "Alexa" has an unusual /ks/ cluster. "Hey Siri" has a distinctive vowel pattern. "Okay Google" is four syllables with uncommon structure. These were chosen by teams with linguists on staff.

I picked "Zaphod" because I liked the character. It turned out to be a two-syllable word composed almost entirely of the most common sounds in English -- see my previous comments on being a noob.

But here's the upside: if the approach could work for a phonetically difficult wake word, it would work for almost anything. The 85% false accept rate wasn't a bug in the method—it was a stress test.Phonetic Engineering

The fix required thinking like a linguist. We categorized confusable words: anything with the /æ/ vowel (app, salad, back), words ending in /-d/ (valid, method, period), sibilant-initial words (safari, staff, system), and compound near-misses that approximate "Zaphod" (zap hot, say pod, staff odd).

We generated hundreds of these negative examples across all our TTS voices. The goal: show the model exactly what "almost Zaphod but not quite" sounds like.The Feedback Loop

The initial phrase list was educated guesswork. The real power came from running the model against live audio and seeing what actually triggered false positives.

We built a collection script that ran wake word detection alongside live transcription. When the model fired but the transcript showed something other than "Zaphod," we'd save that audio and log what was said. After a few hours of normal conversation and meetings, patterns emerged:

"app store" triggered reliably
"safari" was a consistent false positive
"appreciate" and "testing" fired constantly
Even "I bought an iPad" would sometimes trigger

These became training data. We'd TTS-generate the problem phrases across all 24 voices, augment them, and retrain. Each iteration shrank the false positive list.The Ratio That Matters

Here's the counterintuitive part: your model needs to hear more negative examples than positive ones. A lot more.

We settled on approximately a 1:10 positive-to-negative ratio. For every "Zaphod" sample, the model sees 10 samples of things that aren't "Zaphod"—including both near-miss phonetic confusers and general speech.

This ratio matters because of how the model will be used in production. In real-world deployment, the wake word appears maybe once every few minutes, while the model is constantly processing audio that isn't the wake word. Training needs to reflect that imbalance

The Augmentation Multiplier

Synthetic voices give us a foundation, but real-world audio is messy—background noise, room reverb, varying distances from the microphone. Data augmentation bridges the gap.

We apply room impulse responses (simulating different acoustic spaces), background noise injection (office ambiance, street sounds), and pitch/time variation (capturing natural speech differences). The distribution: 10% clean samples, 60% realistic conditions, 30% challenging scenarios.

The result: 1,000 base TTS samples become 400,000+ training examples, each representing a different acoustic scenario the model might encounter in production.

Architecture and Runtime

We evaluated three architectures (DS-CNN, GRU, BC-ResNet-8) and landed on GRU for our Raspberry Pi deployment—small enough for <10% CPU, accurate enough for low false accept rates, fast enough for <200ms latency. Runtime safeguards (energy thresholds, VAD, detection cooldown) handle the practical stuff. Details in the repo README.

Results

After implementing the full pipeline—synthetic data generation, phonetic negative mining, aggressive augmentation, and architecture tuning—the numbers told the story:

Before (naive approach):

Recall: 100%
False Accept Rate: 85%
Verdict: Unusable

After (engineered approach):

Recall: 95%+
False Accept Rate: <10%
Verdict: Production-ready

Total investment:

TTS API costs: ~$0.10
Compute time: ~16 hours (mostly automated)
Human time: ~3 hours of setup and monitoring

Compare that to the traditional approach of human data collection, and the economics aren't even close.-----What We Learned

Negative data matters more than positive data. Anyone can train a model to recognize a word. The skill is training it to not recognize everything else.
Think like a linguist. What sounds could be confused with your wake word? Generate negative examples for all of them.
Synthetic voices are good enough. Modern TTS produces the acoustic diversity you need. Multiple providers and voice models give you variation without logistics overhead.
The 1:10 ratio isn't optional. Class imbalance in wake word detection is extreme. Your training data needs to reflect that reality.

Try It Yourself

The complete pipeline is open source: github.com/deepgram/dglabs-wakeword

Swap "Zaphod" for your own wake word, generate your data, train your model. The whole process takes a day or two, not months.-----Have questions or built something cool with this approach? We'd love to hear about it.


"Zaphood"	✅ Wake Word (correct)
"app store"	❌ Wake Word (wrong)
"salad"	❌ Wake Word (wrong)
"testing"	❌ Wake Word (wrong)
"appreciate"	❌ Wake Word (wrong)

Phrase

Model Prediction

"Zaphood"

✅ Wake Word (correct)

"app store"

❌ Wake Word (wrong)

"salad"

❌ Wake Word (wrong)

"testing"

❌ Wake Word (wrong)

"appreciate"

❌ Wake Word (wrong)

Naively Training a Wake Word Model from Scratch

Table of Contents

Table of Contents

The Hook

The Traditional Approach (And Why We Skipped It)

The Synthetic Data Bet

Problem: Teaching a Model What You're NOT

The Augmentation Multiplier

Architecture and Runtime

Results

Try It Yourself

Table of Contents

Table of Contents

The Hook

The Traditional Approach (And Why We Skipped It)

The Synthetic Data Bet

Problem: Teaching a Model What You're NOT

The Augmentation Multiplier

Architecture and Runtime

Results

Try It Yourself