What Developers Should Know About Model Selection, Adaptation, and Tuning for Enterprise Speech Data (Part 1)


Many enterprises generate hours of important audio data daily: customer interactions that determine retention, medical consultations that impact patient outcomes, and field reports that drive operational decisions, just to name a few. While such audio is a gold mine of valuable business intelligence and compliance evidence, there's a problem. In specialized domains, overall error rates can jump to 20–30 percent, and critical utterances in domain-dense segments can fare much worse. A psychotherapy clinical study, for example, found a 34 percent Word Error Rate (WER) in transcribed harm-related sentences (a topic that needs high accuracy) despite a 25 percent overall WER.
A speech-to-text model that's generally accurate but trips over niche, domain-specific vocabulary is, at a minimum, not very useful, since it effectively obfuscates a domain’s most fruitful insights. At worst, if a developer deploys a model under the assumption that it's more accurate at transcribing domain-specific terminology than it really is, that model could produce dangerous outcomes.
The problem isn't that modern speech AI is inadequate. Nova-3, Whisper, and their peers are adept enough at broad language that a decade ago what they can now do would have seemed futuristic, but because they trained on widely available audio like podcasts, YouTube, and audiobooks, they encounter domain-specialized vocabulary relatively less than broader vocabulary. This means that niche terms (e.g., "myocardial infarction," "equity swap," "prima facie," etc.) are more likely to be missed by STT base models.
So, what's a developer to do?
Thankfully, there's hope. Adapting STT models with domain-specific data, for example, can improve models’ performance 5–7 percent. Fine-tuning a pretrained model is another option to close the gap between general speech recognition capabilities and long-tail, domain-specific vocabulary. This creates a challenge for developers, though: how exactly do you select and then adapt or fine-tune models to accurately transcribe your enterprise's unique language?
Enterprise Speech AI Metrics
Selecting the STT model best suited for your company's audio requires understanding how different metrics reveal performance gaps that headline WER numbers miss. Here are some key STT metrics that separate marketing claims from reality (if you're already familiar with STT performance metrics, skip ahead).
Word Error Rate (WER)
WER is one of the most widely reported metrics across academic and industry benchmarks. It's calculated as
WER = (Substitutions + Deletions + Insertions) ÷ Total Words
WER ranges are a decent proxy for model performance, but "acceptable" WERs vary by domain risk, so WER is often paired with domain-term metrics (e.g., keyword recall rate). For some context on WER, humans transcribe well-studied telephone conversations (Switchboard and CallHome English datasets) with around 5 to 7 percent WER. So you might, depending on the application, not accept STT models with higher WER than that.
These are just guesses, but you might target STT WERs that look something like this:
< 5% WER: Potentially acceptable for high-stakes contexts like medical, legal, and financial applications where human review is included after automated transcriptions
5–10% WER: Perhaps acceptable for non-critical application (e.g., transcribing podcasts)
> 10% WER: Probably requires human review
Keyword Recall Rate (KRR)
Another common STT accuracy metric is the Keyword Recall Rate (KRR), which measures how well a model handles specific terms. It's calculated as
KRR = Correctly Transcribed Domain Terms ÷ Total Domain Terms
Here's what KRR looks like. Suppose you're running a logistics company that's tracking the phrase "delivery confirmed" (normally you'd track multiple phrases). Your warehouse audio contained 100 occurrences of "delivery confirmed," but your STT model only transcribed 90 of those occurrences correctly; your KRR would be 90 percent.
For some applications, high KRR (greater than 90 percent) might be more important than low overall WER because overall WER can mask frequent errors on business-critical terms. This is why some recent research directly evaluates term or named entity recall (similar to KRR). In accented clinical speech, for example, Afonja et al. computed medical named-entity recall (drug names, diagnoses, labs, etc.) and found that entity recall lagged even when overall WER was low.
Character Error Rate (CER)
Character Error Rate (CER) is calculated just like WER, except at at the character level instead of the word level:
CER = (Insertions + Deletions + Substitutions) ÷ Total Characters
CER is especially useful in technical domains where one character can change semantics to disastrous consequences (e.g., “mg” vs. “mcg” in prescriptions).
Real-Time Factor (RTF)
Accuracy is important, but many applications need to optimize for speed too, which is what Real-Time Factor (RTF) measures: processing speed relative to audio length. It's calculated as
RTF = Time to Process Audio ÷ Total Audio Duration
For example, a model that transcribes two minutes of audio in a minute would have an RTF of 0.5.
You want an RTF of less than 1.0 for interactive or streaming use cases. A lower RTF is preferable (e.g., 0.3–0.5) because your model’s RTF must be fast enough to allow time for audio capture, codec encoding and packetization, network transit, post-processing, and any downstream actions; otherwise, the user experience will not feel responsive.
For batched transcription, RTF represents throughput: at RTF = 0.5, the real elapsed time to finish transcribing is half the audio length (e.g., 60 minutes of audio completes in 30 minutes).
Confidence Calibration
Many STT models assign confidence scores to transcription segments, which can be informative metrics if they’re accurate. Confidence calibration is a post-decoding adjustment that aligns predicted confidence with observed accuracy so that a word that the model transcribed at 80 percent confidence, for example, is actually correct about 80 percent of the time.
Analyze Enterprise Audio
Before you can select a model and measure anything, though, you need to analyze your audio’s actual characteristics without assuming anything about it.
To do this, collect representative samples from disparate audio sources in your organization. Selecting only clean, executive presentation audio, for example, wouldn't help you determine the model best suited for your enterprise if most of your audio tends to be noisier, multi-speaker hallway conversations. So include all types of audio that your downstream systems currently ingest and all the types you think they might eventually ingest. For example, this might include:
Conference calls of varying connection quality
Field recordings filled with background machinery humming
Customer service calls with compressed phone audio
International meetings rife with diverse accents
Legacy recordings with atrocious audio quality
After you scrounge together a speech corpus, hire humans to carefully transcribe it all, developing a gold standard test set. With that, you can test candidate models on metrics like WER, CER, and RTF. But enterprise-critical metrics like KRR (Keyword Recall Rate) require knowing which terms matter most to your business.
How do you know what terms these are? One approach is asking domain experts to handpick critical terms (e.g., pharmaceutical names in medicine, chemical compounds in manufacturing, product codes in retail, etc.). A more scalable method is Term Frequency-Inverse Document Frequency (TF-IDF), which automatically discovers terms that appear frequently in your domain but rarely elsewhere. You can apply TF-IDF to your speech transcripts plus any domain-relevant text from documentation, emails, and other sources.
Python libraries like scikit-learn's TfidfVectorizer simplify this. TfidfVectorizer’s main inputs are your corpus (transcripts + additional documents) and optional preprocessing settings like stopword removal or n-grams. Its output is a ranked list or sparse matrix that shows which terms or phrases are disproportionately frequent in your domain. These are terms or phrases that are likely to be important to your business and underrepresented in general STT models. Because STT models train on large datasets, they tend to learn common "stopwords" (e.g., "the," "and," "is") very well, so filter these out so they don’t cloud out what you're after—the domain-specific phrases that candidate STT models might not have encountered often during their training.
Here’s a simple hypothetical example of what TF-IDF might return for financial audio transcripts with stopwords removed (otherwise you’d likely see “the,” “and,” “is,” etc.). Terms with higher TF‑IDF scores are relatively distinctive to your corpus: they appear frequently within some documents and are rarer across the overall corpus (TF pushes them up, and IDF keeps ubiquitous phrases down):
The higher-ranking terms are domain-specific vocabulary worth testing in your STT evaluations.
Using your domain’s key vocabulary that you identified via TF-IDF analysis, expert input, or both, you can now test candidate models on KRR and related metrics before any adaptation or tuning. You’ll want these baselines so you can measure later gains.
For a candidate open-weights model, you can also cross-check your key domain vocabulary against that model's tokenizer. STT models use subword tokenization (e.g., Whisper uses Byte-Pair Encoding), so you can try tokenizing audio clips containing your enterprise’s most important phrases to check they map to single tokens or suffer heavy fragmentation, which often correlates with poorer audio recognition and signals you may need more adaptation. Your domain’s key vocabulary also becomes valuable later on since you can use it for adaptation techniques like keyword boosting.
The next step is picking some models to test. Learn how in Part 2!
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.