
By Jack Kearney
Staff Research Scientist
Last Updated
We recently deployed an improved version of Flux, trained using a novel reinforcement learning-based fine-tuning/post-training paradigm that specifically rewards end-of-turn correctness as opposed to correctness at any given point in time. The resulting models exhibits improved transcription, by up to 10% relative on certain audio domains, and significantly reduced false positive start of turn detection, by up to 70%.
The Perils of Being Greedy
As astute observers of Flux outputs might have noticed, Flux transcripts are constantly being revised throughout the turn, analogous to how human understanding of speech evolves as we ingest more context. For those more familiar with LLMs, the way this works under the hood is loosely analogous to the concept of “test time compute;” Flux has some transcription “budget” that it spends over the course of the turn. The reason for this approach is to unlock low latency end-of-turn transcription; by constantly computing a hypothesis of the most likely transcript were the turn to end at that moment, Flux is ready to provide a high-quality transcript as soon as the turn actually ends.
Like most traditional STT systems, Flux was initially trained to transcribe all words in a given complete audio segment (i.e., given this audio segment, produce all of these words that occurred in the segment). While this works relatively well in the inference context described above, it still results in a modest train-test mismatch, since during continuous streaming inference Flux will generally be called upon to update its transcript while a turn is still incomplete. By greedily transcribing everything in such moments as though the turn were complete then, Flux would occasionally (read: infrequently but still more than our target goal of “never”) either (i) hallucinate common words (e.g., “Hey” or “I”), resulting in false positive speech detection, or (ii) prematurely output a word or punctuation mark that is unlikely to be correct, effectively wasting some of its finite compute budget that could be better spent elsewhere.
Said another way: the initial version of Flux treated each turn like it might end at any moment, but this is not the case. E.g., if the user is mid-word, then their turn is obviously not ending then, so greedily guessing what the turn would be if it did end then is a waste of test time compute. Similarly, if a user has not even said a complete word, they certainly have not said a complete turn. So, if your goal is to optimize for end-of-turn accuracy, there is no need to be overly eager to predict the turn start. However, our initial training approach did not take this into account, and so Flux was not able to learn that, in certain situations, it might be advantageous to wait for more information.
Since the initial release of Flux, we have developed a new training approach, based on reinforcement learning, that allows Flux to learn how to optimally spend its inference budget in the real, streaming inference setting. The resulting model is slightly more conservative when it comes to transcription, which has three notable implications for developers. Relative to the initial version of Flux, this new version exhibits
- Improved transcription quality, by up to 10% relative for certain types of audio data,
- Reduced false positive rate for start-of-turn detection, by up to 70%, and
- Faster end-of-turn detection at
eot_threshold >= 0.8by 50-150ms.
To see these benefits, developers working with Flux will need to do…absolutely nothing! The new version was already launched in early December; to start, we applied our new training recipe to achieve a modest fine-tuning of the existing Flux model (”Flux V0.1”) such that the behavior is almost identical to that of the original model with the notable exceptions listed above.
Below, we describe the promise and challenges for training streaming conversational STT models with reinforcement learning, and the impact that its use had on Flux.
Why (Not) Reinforcement Learning
If we think about an STT system as a transcription agent then, when we perform supervised training, we are effectively showing the model which sequence of actions to take, i.e., which tokens to output and in what order, given the input audio. Such training examples are relatively easy to create for “complete” turns, since we know all words in the turn need to be transcribed, but for incomplete turns this is much more complicated. For instance suppose at a given time T someone has said “I am walking to…” The turn might end with them saying “I am walking to the store” or “I am walking towards the room” or even “I am walking too much.” So, at T, is it better to output “to,” or wait for more information?
The answer is that, a priori, we do not know, since the preferred path is the one that yields the most accurate transcript as soon after the end of turn as possible. So, rather than being prescriptive about what the model should do at any given time, an approach more aligned with our ultimate goal of fast and accurate end-of-turn transcripts is to let Flux explore different approaches to transcription and then reward the ones that yield the best results. And the way to do this is to use reinforcement learning (RL)!
Careful readers to whom this seems obvious will note that the introduction implied Flux V0 was not trained this way, and so might ask “why not?” Well, and stop me if you’ve heard this one: while RL is very powerful, it’s also very tricky to get right. During RL, models generally learn to solve exactly the problem they are given, which can have unintended consequences. For instance, think about the numerous complaints you’ve seen about coding agents generating thousands of lines of code that their developer cannot understand; these models have frequently been trained to generate a working system based on a > 100k token context window, not to write readable code for a carbon-based lifeform with a < 200 token context window!
Similarly, for streaming STT systems, we found RL could lead to the following negative behaviors
- “Completely giving up” on transcribing challenging words in favor of “focusing” on easier words (since in a naive reward scheme a reduction in “replacement” errors can offset increase in “deletion” errors), or
- “Delaying” transcription, allowing the model to collect as much information as possible before outputting anything.
So, while we came early to the realization that RL would be the right tool to train the model to make good “intra-turn” decisions, it took a while to develop a satisfactory recipe.
Flux V0.1: A More Conservative Transcriber
In the same way that reinforcement learning promotes an LLM from a “stochastic parrot” that tries to regurgitate good answers to an “agent” that tries to satisfy the user, RL arguably promotes Flux from a naive STT system that tries to transcribe everything it has heard to a transcription agent that decides when/what to transcribe as audio streams in. So, how does the “agent” version of Flux, V0.1, compare to the original V0? The answer is that the model is more conservative (i.e., better at spending its budget) and, as such, able to achieve higher accuracy and less likely to be “tricked” into thinking some non-word sound is the start of turn. For instance, if you were to painstakingly evaluate the “working transcripts” output by Flux V0.1 throughout the course of the turn (note: you should not actually do this since it’s annoying and, anyway, we have done it for you as shall imminently become apparent), you would see a reduction in cases where the model
- outputs a word and subsequently removes it, by 20%, and
- changes the last word it output, by 30%.
Notably, this is not the result of a hard-coded lookahead/delay, nor does the model learn to consistently delay transcription. Indeed, in our RL training paradigm, the model could not learn a consistent delay since it is still penalized for not having a maximally accurate transcript at end-of-turn. Instead, the model learns when to delay transcription or not, allowing improved usage of budget without sacrificing end-of-turn latency.
Improved Turn Detection
The advantage of a more conservative transcriber is that it is less likely to be tricked into thinking it heard speech when it did not, or to have to delay end-of-turn while it fixes an incorrect transcript. Correspondingly, Flux V0.1 exhibits fewer false positive start-of-turn detections, and reduced end-of-turn detection latency, particularly at higher eot_threshold and in the tails.
To evaluate start-of-turn detection, we compare Flux’s StartOfTurn detection time with the start time of the first word in the turn (for more details on how we evaluate these turn-oriented STT models, see Evaluating End-of-Turn (Turn Detection) Models). In this paradigm, a detection latency of “zero” is not really to be expected, since that would correspond to detecting and outputting the word before it had been fully spoken. We do this so that latency < 0 has a distinct interpretation, namely that such cases correspond to a likely false positive, since we detected a word before any were spoken.
The plot below shows the full cumulative density function of StartOfTurn detection latency for Flux V0 and V0.1. Since it’s hard to see (like we said, Flux V0 only occasionally falsely detected turn start in our benchmarking), we have included in the caption the density (i.e., frequency) of detections with latency < 0, i.e., false positives. Flux V0.1 achieves a 0.4% false positive rate, an over 70% reduction compared to the 1.5% rate observed for Flux V0.
As with most things, this improvement is not entirely free (I was disappointed to find, upon moving into ML in industry, that there is “no free lunch,” especially given free lunch is one of the primary motivators for physics PhD students). Flux V0.1 typically detects start-of-turn ~40ms slower than Flux V0. However, as the CDF of first word durations indicates, both versions still detect turn start within 200ms of when the user actually starts speaking, well within the regime of “normal.”
Just like for start-of-turn, improvements for end-of-turn show up in the tails, i.e., those difficult cases where being a little bit more conservative might lead to a higher quality prediction. The plot below is analogous to the one above, but focused on end-of-turn detection latency, for an eot_threshold = 0.8. At the median, we see a modest reduction of median latency by 40ms, but speedups of closer to 100-150ms at higher percentiles. Similar to the above, for reference we also show a measurement of the typical gap between turns in conversations between two humans; notably, Flux reproduces a very realistic “shape” of turn gaps, but is still able to detect turn end quite a bit faster than a typical human response.
Below, we show the WER and end-of-turn detection F1 score as a function of median detection latency as controlled by eot_threshold (lower threshold = crossed earlier = faster detection, but more false positives). As we can see, the new model behaves almost identically, apart from being slightly faster (and more accurate!) at higher thresholds.
We do observe a minor slowdown (< 40ms) at the left-side of the plot, but note this corresponds to eot_threshold = 0.6, lower than the values preferred by most developers.
Improved Transcription Quality
Since the changes in behavior predominantly impact working transcripts and the implications for turn detection, it is not immediately obvious these changes would necessarily result in significant improvements to finalized accuracy. However, again leveraging the analogy to “test time compute,” it is not unreasonable that this is the case. If you have a finite time (or token budget) to answer a question and you devote some of that precious budget to a fruitless line of reasoning, that reduces the time you have available to find the right answer. And, indeed, we see that these changes in behavior do correspond to meaningful changes in transcription accuracy.
Some of this is apparent above; Flux V0.1 exhibits a lower WER across most of the parameter space than its predecessor. However, since “two person conversation-oriented” data can somewhat narrow in terms of acoustic conditions or topics covered, we also compared the “pure transcription” capabilities of the model on a broader data sample.
When evaluating STT models at Deepgram, we typically use internal test sets that are reflective of the wide range of real world use cases and audio conditions encountered by our customers. We do not prefer open source evaluation sets such as Common Voice due to their more artificial nature, and in fact have found that achieving high performance on Common Voice can typically come at the detriment of performance on customer data. For instance, whereas our internal test sets consist of natural conversation (including natural pauses and intonation), Common Voice consists predominantly of people “reading” off given sentences. Also, since the test split (in fact, the whole dataset) is public, there may be temptation to over-fit on Common Voice in order to look more impressive on benchmarking platforms.
In this case, however, our internal evaluation on a truly held-out Common Voice test set revealed significant improvements from this methodology, potentially due to the model’s ability to “wait” when confronted with challenging or more stilted speech. Specifically, whereas we observed a modest 3-5% improvement on our internal test sets, the improvement on our Common Voice test set was closer to 10%! So, especially since we are focused on relative Flux improvements and not comparing to competitors (who might have their own view of what an appropriate held-out set is), here we also share the results of our internal Common Voice benchmarking.
The plots above show finalized transcription accuracy on two test sets, comparing the original version of Flux with this latest update. As you can see, the new version of Flux is modestly more accurate! This was a pleasant surprising in the land of English STT, where we find our models are very close to the accuracy ceiling imposed by ground truth noise (either due to inherent ambiguity in transcription or annotator mistakes).
Conclusions
Our new training approach results in a more economical version of Flux, resulting in better overall accuracy and, notably, fewer false positives when it comes to start-of-turn detection. Now, if only I could work out how to train my offspring to be more economical with their “TV budget,” we’d really be onto something…



