What Developers Should Know About Model Selection, Adaptation, and Tuning for Enterprise Speech Data

Fine-Tuning STT Models (When Adaptation Fails)
Data Requirements for Fine-Tuning
Data Requirements Are Minimal
LoRA: Accessible Fine-Tuning
Improvements You Can Expect from Fine-Tuning
Maintaining Performance Over Time
Your Next Move

Share this article

By Brad Nikkel

AI Content Fellow

Last Updated

Nov 18, 2025

Fine-Tuning STT Models (When Adaptation Fails)

Fine-tuning adapts a pre-trained STT model to your audio. It demands some compute and expertise (much less than building models from scratch) but can be worth the investment because fine-tuning can substantially boost accuracy when your use case diverges from a base model's pretraining data.

Data Requirements for Fine-Tuning

To fine-tune STT models, you need audio with corresponding transcripts. High-quality transcripts matter more than raw quantity, as you'll see in a bit, but creating such transcripts takes time. Careful manual transcription can consume 2-3 times the actual audio duration. Semi-automated transcription, where a machine transcribes a rough draft and a human reviews it, might accelerate the labeling process, but humans are still a bottleneck.

You might wonder, then, how much labeled audio you actually need to fine-tune a STT model. Surprisingly, not much.

Baevski et al. demonstrated this twice in quick succession. In one study, they pretrained on 960 hours of unlabeled LibriSpeech audio. Then, by fine-tuning on a measly 10 minutes of labeled audio, their model achieved 16.3 WER on LibriSpeech test-clean (clear speech) and 25.2 WER on LibriSpeech test-other (noisier, accented speech). When they fine-tuned on a modest 10 hours of labeled audio, their model matched the best 100-hour systems (at that time) on LibriSpeech test-clean and reduced LibriSpeech test-other WER by around 25 percent.

A few months later, the same team took the idea further with wav2vec 2.0. After pretraining on 53,000 hours of unlabeled audio (roughly 55 times more than their first study), they fine-tuned with just 10 minutes of labels and achieved 4.8 WER on LibriSpeech test-clean and 8.2 WER on LibriSpeech test-other—far better than their earlier results. With only one hour of labeled audio, their model surpassed prior models trained on 100 hours of audio.

Data Requirements Are Minimal

The takeaway from these studies? Just 10 minutes of high-quality transcribed audio can yield meaningful improvements. This means that, data-wise, any developer can transcribe their own audio (in around 20 to 30 minutes), fine-tune a model, test the results, and then—assuming the results are promising—use those to pitch investing in more labeled audio.

But traditionally, despite minimal data requirements, fine-tuning large STT models has demanded significant computational resources, including multiple high-end GPUs and extended training times stretching from days to weeks. This put fine-tuning large models out of reach for many development teams.

LoRA: Accessible Fine-Tuning

Fortunately, parameter-efficient methods like Low-Rank Adaptation (LoRA) have widened the playing field. LoRA freezes the base network and trains only small low-rank matrices, cutting trainable parameters by roughly 90 percent while retaining most of full fine-tuning's accuracy gains.

Here are LoRA's benefits:

Memory: LoRA reduces GPU requirements by about 3 times, making Whisper fine-tuning possible on consumer GPUs with 8 GB VRAM
Speed: AWS's managed LoRA fine-tuning completes in 6-8 hours for 12-hour datasets rather than days or weeks
Storage: LoRA checkpoints are tiny compared to the original model size, making deployment and versioning much simpler
Accuracy: Research on Whisper fine-tuning shows LoRA achieves WERs within a few percentage points of full fine-tuning.

Improvements You Can Expect from Fine-Tuning

If you opt for fine-tuning, manage your expectations and consider your extra training and operational costs against potential gains. Based on data, label quality, and evaluation protocol, your results may vary, but here are some reported results:

Domain adaptation: Fine-tuning models to understand domain context through text descriptions yielded Liao et al. around 17 to 33 percent relative WER reduction on domain-specific tasks. They achieved this by training Whisper to use simple text hints like "medical conversation" or "air traffic control" to better handle specialized vocabulary and topics.
Accent adaptation: Afonja et al. fine-tuned on accented speech from their target population, yielding 25 to 34 percent relative WER reduction on domain-specific terminology. They achieved this by training Whisper on African-accented clinical speech, which was especially important for accurately transcribing drug names and diagnoses where errors could be dangerous.

As powerful as it is, fine-tuning isn't a one-off task; you'll need to revisit it as your domain evolves.

Maintaining Performance Over Time

Language evolves; new slang, product names, and acronyms appear; audio quality, microphones, accents, and user behavior shift. All these factors can cause "model drift," which refers to a decline in performance over time. In fact, Vela et al. found that 91 percent of AI models experience significant performance degradation within a year of deployment. So iif you fine-tune some model and then do nothing else, expect its accuracy to decline annually.

You're better off thinking of "fine-tuning" as cycles of retraining. If your application allows for user corrections, for example, create feedback loops. Because every user correction could become valuable training data, you can systematically capture these corrections and identify patterns. Repeated, confined errors (e.g., on specific terms) might trigger vocabulary updates, and more systematic errors might suggest that fine-tuning or a model change is in order. Setting up automated feedback loops like this can transform your STT system from a static tool to a system that improves with time.

Your Next Move

Now you can see that successful enterprise STT isn't about finding some mythical “best” model. It’s more about selecting and adapting a model for your specific application and domain, which involves many considerations. Start by calculating baseline metrics on your domain's audio, try keyword boosting (or similar adaptations) first, then move to fine-tuning if needed. Even modest accuracy gains can compound: a two percent WER reduction, for example, might save hundreds of review hours, unlock new use cases, or catch critical information that would have otherwise been missed.

Looking for a simple way to start? Tune for success with Deepgram's adaptive STT model options, which handle most common enterprise needs without the complexity of custom training. Or, if you have more specialized requirements, like fine-tuning a custom model, [ask] about our Enterprise Plan!