
By Julia Strout
Deep Learning Engineer
Last Updated
Keyterm Boosting Background: Why We Learn
I’m going to let you in on a secret; at Deepgram, keyterms are part of our models’ DNA.
In STT systems, keyterm boosting refers to the ability to provide the system with a set of words that are known to the user a priori to be likely to occur in the audio, increasing the likelihood that these terms are transcribed correctly. Keyterm boosting is frequently used to improve accuracy on proper nouns, which are unlikely to have appeared in the model’s training data, and may exhibit non-standard spelling or capitalization.
A common approach to keyterm boosting is to modify the outputs of a standard STT model. For instance, during inference, a decoding algorithm like beam search generates an N-best list of hypotheses, and then boost hypotheses including the given keyterms. Classical approaches use manually-designed rules for what to boost, but boosting can be done by a separate “re-ranker” model trained to re-order the top N STT hypotheses based on audio and keyterms (or a multimodal LLM instructed to do so). An advantage of these approaches is that they explicitly make use of acoustic information, but can also be costly since you need to generate N completions for each input, and may be running a secondary model.
If you want an approach based purely on the top STT hypothesis, you can employ text-based post-processing. Some systems allow users to provide custom dictionaries that use rules to rewrite expected system outputs to desired keyterms. E.g., if we observe the model tends to output “deep gram” when we say our company name, we might provide the mapping {"deep gram": "Deepgram"}. An alternative approach, which offers higher recall at the cost of precision, is to replace any words at the transcript that are a “fuzzy match” for those in the list of keyterms. For even more sophistication, a text-based model, such as an LLM, can be trained or instructed to identify possible keyterm replacements based on linguistic context.
These ex post training approaches are good for providers because their ease and modularity — you don’t have to change your STT training pipeline, and the boosting procedure can be tweaked or trained independently, notably including different data to that used to train the STT model. They also in principle generalize to an arbitrary number of keyterms, though in reality compute constraints will enforce a limit. But, they generally push additional complexity onto customers by forcing them to manually configure boost factors, mappings or fuzzy match distance thresholds. And they are generally performance limited, in terms of both accuracy and latency.
By comparison, the latest Deepgram STT (Nova-3 and Flux) models learn how to boost keyterms as part of training; they take in not only audio but also keyterms, and use both to generate a high-quality transcript. Mathematically, we model
i.e., the probability of the next word in the transcript conditioned on the audio a_{0..T}, keyterms k_{0..n} and prior words w_{0..t}. As always, allowing the model to learn the right behavior from data results in better performance relative to manually engineering behavior. Also, it makes life much easier for customers; they only have to provide the list of keyterms, and the model does the rest.
We’ve learned an interesting thing during our research into keyterm-boosting models; there’s more than one way to key a term! Specifically, we have found that there are many ways to successfully add keyterm boosting to an STT model, but that they offer different advantages and disadvantages, making some more suitable to certain settings than other. In the rest of this post, we will provide some examples, and then describe how we evolved our keyterm boosting approach from Nova-3 to Flux in order to optimize for the constraints imposed by realtime streaming.
Keyterm Tradeoffs
As mentioned above, we have discovered there is no “universally correct” way to incorporate keyterms into an STT model. Said another way, there is a large space of models that represent the equation above. These different choices can be thought of as different inductive biases regarding how keyterm inputs should be represented to the model, and they offer different strengths and weaknesses.
For example, one approach, conceptually similar to how prompts are fed to LLMs, is to combine the keyterm input stream with the transcript output stream, such that keyterms are treated as special cases of “prior tokens,”
This approach makes a lot of sense in that it combines all text into one stream and audio into another. As such, whatever mechanism the STT model has for comparing transcript to audio trivially extends to keyterms, simplifying the architecture.
However, it also has limitations. The first relates to symmetries, specifically permutation invariance. Generally, we want keyterm boosting to be permutation invariant, i.e., outputs should not depend on the ordering of the keyterms. By contrast, the transcription is manifestly not permutationally invariant; a jumbled transcript is not as useful as one that is correctly ordered. So, for networks that see keyterms as having specific positions, either through positional embeddings or their order in a tensor that is processed recursively, you may need to train the network to respect permutation invariance. Training symmetries into networks is imprecise, and generally increases training cost.
The second limitation relates to computational cost. By tying keyterms to transcripts, you are effectively dedicating the same amount of compute to each when this is not clearly necessary. For instance, in models that use the attention mechanism, this would increase the cost of attention from O(T^2) to O((T+K)^2), which might limit the number of keyterms the model can cost effectively support.
An alternative is to represent keyterms as an entirely separate input stream. This enables more flexibility, potentially making it easier to control the compute cost associated with adding keyterms and enforce properties like permutation invariance. But, the additional flexibility also substantially expands the space of possible architectures to explore. For instance, should the model have mechanisms for explicitly combining each pair of streams individually? Or, should they always communicate through one stream? Should both audio and keyterm inputs enter at the “bottom” of the model, or can keyterms be injected later? We do not claim to have explored the whole space of possibilities, but we have definitely found some choices work better than others. For instance, we have found giving the transcript stream access to a single merged audio/keyterm stream can be made to be very computationally efficient, but degrades accuracy.
Another interesting example to consider is how keyterms should be represented to the bulk of the network; as single entities or as sequences of subwords (i.e., tokens)? And, if using tokens, is it better to use a custom vocabulary specifically for keyterms, or to share the vocabulary used for the transcript? Interestingly, we’ve found that subwords are not necessary to achieving high performance, but that such models can be easier to train; the performance of models that use a single representation per keyterm is somewhat sensitive to how that representation is generated. For instance, we have found that representing a keyterm by simple pooling of subword embeddings generally leads to lower keyterm accuracy compared to more sophisticated encoders.
From Nova-3 to Flux
During Nova-3 development, we settled on an architecture that optimized for keyterm boosting performance, as measured by keyterm recall rate (KRR), i.e., ability of the model to correctly identify keyterms when present. However, when it came to Flux, we found we had to re-think our approach to better suit the low latency/real time streaming setting for which Flux was intended.
Since Nova-3 operates in either a batch setting or a streaming setting with (comparatively) “infrequent” updates, we cared most about the total compute cost and memory requirement of our approach. In these settings we are less sensitive to how much memory the keyterm state takes up, since we have plenty of time to transfer state around or recompute it. Likewise, when streaming, the increased time between updates makes us less sensitive to the cost of each individual token.
In contrast, realtime streaming is all about managing marginal compute cost; you need to be able to process new data quickly enough keep up with the rate of data ingress. And, any state you need to persist or move around should ideally be as lightweight as possible. As a result, for Flux we were willing to pay more upfront fixed cost per stream if it reduced our marginal cost when decoding. Less vital, but admittedly also relevant: Flux uses a somewhat more complicated training curriculum, so we were also interested in modifications that might make training more efficient, such as simplifying how keyterm permutation invariance was built into the model.
Drawing on some of the observations above, and a not small number of whiteboards and scraps of paper, we were able to completely redesign how Flux performs keyterm boosting to significantly reduce our marginal cost. In particular, we increased the sophistication of how we initially encode keyterms, and aggressively pruned the computation associated with the encoded keyterms within the core STT network. Ultimately, we were able to reduce per token decoding cost and memory required to store keyterm state by up to 90% with only a modest 25% increase in the cost associated with initial processing and encoding of keyterms. Most important, we were able to do so without sacrificing accuracy relative to Nova-3. Right on!
Violin plots comparing Flux keyterm recall rate (KRR = correct keyterms / total keyterm occurrences) and keyterm insertion rate (KIR = inserted keyterms / total words) to that of Nova-3. Flux performs almost as well, in spite of being computationally much cheaper and offering real time streaming.
