Table of Contents
Large vocabulary speech recognition gets tricky in production when your users say terms a general model hasn't learned well. In speech-to-text systems, that usually means drug names, policy numbers, product SKUs, and legal terms. You often don't see the mismatch until after launch.
This article gives you a practical framework for large vocabulary speech recognition. You'll learn when runtime vocabulary injection solves the problem, when it stops helping, and when custom model training becomes the better path.
Key Takeaways
Here's what you need to know before choosing a vocabulary customization path:
- Accuracy depends on your domain's out-of-vocabulary term density, not a fixed word count.
- Keyterm Prompting on Nova-3 handles up to 100 terms at inference time, but Deepgram recommends staying in the 20–50 term range for reliable results.
- Force-fitting risk increases as keyterm count grows, with no user-accessible tuning parameter to control it.
- Custom model training typically uses 10–30 hours of unannotated audio and often completes in under three weeks.
- Production audio performs worse than benchmark audio, so test customization on real calls, not clean samples.
What Large Vocabulary Speech Recognition Means in Production
In production, large vocabulary speech recognition is about OOV density, not a fixed dictionary size. What matters is how often people say terms the model hasn't learned well.
The Academic Definition vs. the Production Reality
Academic literature often defines LVCSR as systems handling 20,000+ words. That threshold mattered in the era of HMM-based systems with finite dictionaries. Modern transformer-based models train on billions of tokens. They don't have a hard vocabulary ceiling. The practical constraint has shifted to your domain's OOV term density.
How OOV Terms Spread Errors
OOV terms don't just create isolated transcription errors. They can inflate overall WER beyond what in-vocabulary mistakes alone would cause. Google Research found that reducing vocabulary from 25,000 to 7,000 words increased OOV rate by 8.2 percentage points. Overall WER jumped 7.3 points.
During customization, catastrophic forgetting can degrade representations for words the model previously handled well.
Where General Models Break First
You'll usually see general-purpose ASR models fail first on the terminology categories you care about most. Drug names, alphanumeric identifiers, and specialized proper nouns are common breaking points.
Drug names become phonetically plausible nonsense. Alphanumeric identifiers such as policy numbers, tracking IDs, and member codes get mangled into common English words. Five9 found that alphanumeric speech inputs were the core transcription failure class in contact center self-service. Their healthcare customers saw user authentication rates double after that vocabulary gap was addressed.
Why Standard Models Struggle with Domain Vocabulary
Standard models miss domain vocabulary because they favor terms seen often in broad training data. When audio gets messy, that bias gets stronger.
The Phonetic Substitution Problem
When a model encounters a word outside its training data, it doesn't produce silence. It outputs the closest phonetic match from its learned vocabulary. "Tretinoin" becomes "try to win." "Bacon cheeseburger" becomes "bake in." "Billing department" becomes "building." Each substitution is acoustically reasonable to the model, but wrong for your application. These failures are systematic. The model's probability distribution favors common words over domain terms. If you've debugged transcription errors like these before, you know how maddening the pattern gets.
Long-Tail Terms and Training Data Gaps
Your most critical vocabulary is often your rarest. Medication names, proprietary product codes, and specialized legal phrases appear infrequently in general training corpora. The model has seen "representative" millions of times, but "tretinoin" almost never. That frequency imbalance means your highest-value terms get the lowest recognition accuracy. A Fortune 50 retail pharmacy achieved 92% accuracy on critical pharmacy phrases after domain-specific model customization. Complex medical vocabulary still sat at 85.3%.
Why Noisy Audio Amplifies Vocabulary Failures
Noisy production audio makes vocabulary failures worse. As acoustic evidence weakens, the model relies more on common-language priors instead of your domain terms.
Production audio doesn't sound like benchmark recordings. Studies document a 2.8–5.7× WER degradation from benchmark to production environments. Controlled medical dictation achieves roughly 8.7% WER. Multi-speaker clinical conversations exceed 50% WER. Background noise, overlapping speakers, and low-quality microphones reduce the acoustic signal the model relies on. When clarity drops, the model leans harder on its language prior.
Runtime Customization for Large Vocabulary Speech Recognition
If you need fast help with a small term set, Keyterm Prompting is the first tool to try. It works best when your list is focused and stable.
How Keyterm Prompting Works at Inference
Keyterm Prompting is built directly inside the Nova-3 model architecture. It's contextual and model-driven. It isn't a post-processing filter or an external logit manipulation layer. When you pass keyterms via the API, the model adjusts decoding behavior to favor those terms when the acoustic evidence supports them. This differs from the legacy Keywords feature, which applied an external weighted boost with a user-adjustable numeric intensifier.
What to Include in Your Keyterm List
Keep your keyterm list tight. Include terms the model actually struggles with, not words it already handles well.
Product names, drug names, branded terminology, and domain-specific jargon are ideal candidates. Common English words don't need boosting. Deepgram's documentation recommends keeping your list to the most important 20–50 terms rather than filling the full budget. The API enforces a 500-token limit per request. That's roughly equivalent to 100 terms. If you exceed it, the request returns an error. There's no silent truncation.
Real-World Use Cases Where It Delivers
Keyterm Prompting delivers when your vocabulary is bounded and of high value. That's why it fits IVRs, ordering flows, and pharmacy voice agents.
The confidence score improvements are significant for the right use cases. "Nacho" jumps from 0.887, misheard as "macho," to 0.990 with Keyterm Prompting. "Prescription refill" goes from a fragmented "per scription" at 0.79 to "prescription" at 0.98. QSR ordering systems, pharmacy voice agents, and contact center IVRs with bounded product names are strong fits. It gives you instant vocabulary customization without waiting weeks for a retrained model.
When Keyterm Prompting Reaches Its Limits
Keyterm Prompting stops being reliable when your list gets too large or too ambiguous. At that point, term injection becomes harder to manage than the recognition problem itself.
The Ceiling and What Happens Beyond It
The API enforces a hard token budget. If you exceed it, you get an error instead of degraded results. But the practical ceiling is lower than the documented limit. Deepgram's own guidance recommends a much smaller list for reliable operation. Users have reported that larger keyterm counts lead to higher error rates. The model can overfit and force matches on terms that weren't spoken. Splitting key terms across parallel streams was suggested as a workaround—it was reported to worsen the problem rather than solve it. Not elegant, and it doesn't work.
Force-Fitting and Aggressive Context Matching
Force-fitting is a confirmed failure mode. The model transcribes key terms that weren't actually spoken in the audio. Deepgram has acknowledged this as a pathology under active investigation. Because Keyterm Prompting operates inside the model architecture, there's no user-accessible tuning parameter to diagnose or mitigate this behavior. That's a critical difference from the legacy Keywords feature. There, false positives scaled transparently with the intensifier value you set. With Keyterm Prompting, phonetically similar terms are particularly risky.
Vocabulary Scenarios That Require a Different Path
Three scenarios signal that you've outgrown runtime injection. First, your domain vocabulary exceeds that practical ceiling. Second, you're seeing force-fitting errors and can't reduce your list further without losing critical coverage. Third, your terminology changes often enough that managing per-request keyterm lists becomes an operational burden. In each case, the official guidance is clear: contact Deepgram to discuss custom model training on the Enterprise plan.
Custom Model Training: The Path Beyond Runtime Injection
When runtime injection no longer covers your vocabulary, custom model training becomes the stronger path. It takes more time, but it handles broader terminology with deeper customization.
What Custom Training Adds That Keyterm Prompting Cannot
Custom training embeds your domain vocabulary into the model's learned representations. The model doesn't just bias toward your terms at inference. It treats them as first-class vocabulary. This removes the large-list force-fitting risk of runtime injection and handles broader terminology. Deepgram's custom training can improve accuracy by 5% to 15%. In many cases, it brings WER below 5%.
Data Requirements and Realistic Timelines
The data requirement is lower than most teams expect. Deepgram's Model Improvement Partnership Program (MIPP) often uses roughly unannotated customer audio in the 10–30 hour range. Deepgram handles transcription and annotation, so the labeling burden doesn't fall on your team. Training typically completes in under three weeks, though complexity and review cycles can extend that window. You also provide a written list of critical keywords with correct spellings.
WER Improvement Ranges by Domain
Custom training can produce large gains, but results vary by domain and baseline accuracy. It also carries trade-offs if specialization affects general vocabulary performance.
Healthcare domains have shown up to 99% relative WER improvement in research using synthetic data augmentation. Keep this caveat in mind: domain-specific training carries a catastrophic forgetting risk. A Whisper model with 2.7% baseline WER degraded to over 20% WER after medical model customization. Mitigation techniques like experience replay keep out-of-domain degradation under 10% while preserving 40–65% of specialization gains. Deepgram's managed training pipeline handles these trade-offs so you don't have to manage them alone.
Matching Your Vocabulary Strategy to Your Deployment
Choose your path based on term count, domain specificity, and delivery timeline. In most cases, that means starting with Keyterm Prompting and escalating only when your vocabulary or failure modes demand it.
The Decision Framework at a Glance
If you have a small domain term set and need results today, Keyterm Prompting is your starting point. You'll see immediate gains with zero retraining. If your vocabulary approaches the documented limit, proceed with caution. Monitor for force-fitting and be prepared to escalate. If your vocabulary exceeds runtime injection capacity, or you see force-fitting at lower counts, custom model training through the MIPP is the documented path forward.
Getting Started with Vocabulary Customization
Start with a transcript audit before you choose a tool. Your first goal is to measure OOV density and identify the terms that truly matter.
Pull a sample of production transcripts and count the misrecognized terms. If the list is small, Keyterm Prompting gets you to production quickly. If the list keeps growing, that's your signal to begin a custom training conversation. Either way, you'll need a baseline WER measurement to quantify improvement. Test with real production audio, not clean recordings.
Why This Matters for Large Vocabulary Speech Recognition
Large vocabulary speech recognition isn't a contest to stuff the biggest possible term list into one request. It's a deployment problem. You need the right path for your error pattern, your audio conditions, and your operating constraints.
If your misses cluster around a short list of high-value terms, runtime prompting can help fast. If the misses keep spreading across product names, codes, and specialized phrases, custom training becomes the cleaner long-term fix.
Start Transcribing Today
If you want to evaluate large vocabulary speech recognition on your own audio, start with real production samples. That's where you'll see whether a short keyterm list is enough or whether you need deeper model customization.
What to Test First
Pull calls or recordings that contain the terms your application cares about most. Then compare the baseline transcript against a run with Keyterm Prompting. Watch for two things: corrected domain terms and any new force-fitting errors.
Next Steps
Deepgram offers $200 in free credits to new accounts. You can test Keyterm Prompting against your domain audio before committing to any architecture decision. Try it free and see how your terminology performs with runtime vocabulary customization on real-world audio.
FAQ
What Is the Difference Between Keyterm Prompting and Keyword Boosting in Deepgram?
Keyterm Prompting is a Nova-3 native feature that works contextually inside the model. The legacy Keywords feature applies an external numeric boost to logit scores on Nova-2 and earlier models. Keywords only affect OOV terms and processes multi-word phrases word by word. Keyterm Prompting supports both in-vocabulary and OOV terms and handles phrases as cohesive units.
How Many Labeled Audio Hours Does Deepgram's Custom Model Training Require?
Deepgram's MIPP accepts unannotated customer audio and handles transcription and annotation internally. The MVP data tier starts as low as 1–4 hours for initial domain customization, with diminishing returns observed beyond approximately 860 hours.
Does Large Vocabulary Size Increase Transcription Latency?
Keyterm Prompting is implemented inside the Nova-3 model itself, not as a post-processing filter. Because it operates within the model architecture, it doesn't add a separate processing step that would increase latency. Custom-trained models bake domain vocabulary into the model weights, so inference runs at the same speed as the base model regardless of vocabulary size.
When Should You Move from Keyterm Prompting to Custom Training?
You should move when your vocabulary exceeds the practical keyterm ceiling, when force-fitting appears, or when list management becomes an operational burden. Those are the clearest signs that runtime injection is no longer the right tool.
What Happens When the Same Phonetic Sequence Maps to Multiple Domain Terms?
Deepgram has documented this as a known limitation. Phonetically similar terms—like competing drug names or similar proper nouns—may require careful list curation. Using multi-word phrase context can help the model disambiguate.

