Fluxing Conversational State and Speech-to-Text

Introducing The Flux Chronicles
Welcome to The Flux Chronicles, a deep dive series exploring the research, engineering, and ideas behind Flux, Deepgram’s conversational speech recognition model. Each entry unpacks a different facet of how Flux fuses transcription and conversational state modeling into a single, real-time system, transforming speech recognition from passive listening into active dialogue understanding.
This first chapter, “Fluxing Conversational State and Speech-to-Text,” sets the stage by examining why modeling conversation itself, not just words, is the key to natural, interruption-free voice agents.
Conversational State Management
In addition to the abilities to listen and speak, voice agents need to determine when to do so. The solution employed by many systems is a conversational state machine, i.e., a model that represents both the user’s current behavior, or state, and transitions between states that can be used to signal what action the agent takes and when. The simplest possible example is shown below; it models the user as being in one of two states, either LISTENING or SPEAKING. The EndOfTurn event corresponds to the transition from SPEAKING to LISTENING, and so indicates an appropriate time for an agent to respond. Likewise, a StartOfTurn event that occurs while the agent is speaking may indicate the user is attempting to interrupt the agent or “barge-in,” and thus that the agent should stop speaking.


Before disgruntled AI researchers and pundits start throwing bricks attached to copies of “The Bitter Lesson” through our office windows, let us note that such state machines are not the only, and probably not even the ideal, solution to the problem of controlling voice agent actions. Speech-to-speech (S2S) models, an active topic of research at Deepgram and elsewhere, do not use explicit state machines. Instead, they learn directly from data how to behave in different circumstances. Any notion of conversational state is implicitly encoded in the models’ parameters and activations.
While we remain bullish about the future of S2S, we see several reasons that the state machine approach remains appealing, particularly for today’s agents. First, it allows the voice agent “brains” to be supplied by a text-based or multimodal LLM, which generally exhibit stronger problem-solving capabilities, required for many business use cases, than state-of-the-art S2S models. I.e., we trade the smoother conversation flow of an S2S model for the more valuable ability of the LLM to solve the customer problem. Second, state machines are easier to configure and customize; adapting an S2S model to a particular domain or pattern requires training on dedicated conversational data representing that domain/behavior. Third, state machines are highly interpretable, which makes them easy to debug. By comparison, many S2S models do not exhibit, e.g., text-based internal representations that a human can read to understand what the model “heard” or its “thought process.”
Note that this also makes state machines useful in non-voice agent scenarios, such as human agent assist. Since (a) not all humans operate in their native language or culture and (b) conversational cues vary across languages/cultures, a state machine can provide helpful “prompts” to guide humans towards more natural conversational flow in their second language.
The problem we see with current implementations of state machines is that they are…well…kinda dumb. Or, said more precisely, they lack robustness and consistency, leading to a suboptimal conversational flow, as the result of lack of shared context throughout the conversational model. To be clear, this it not the fault of voice agent developers; they are making great use of the tools they have, just have not always been provided with the correct tools. For example, developers might use a voice activity detector (VAD) like Silero or WebRTC to determine start of speech, a custom turn detector such as Krisp Turn-Taking, Pipecat Smart Turn or LiveKit EOU to determine end of speech, and a speech-to-text (STT) model such as Nova-3 to do the listening for the agent when the user is in the SPEAKING state. With all the different transitions and states modeled independently, it is unsurprising that inconsistencies, delays or just general unnatural behavior can arise! For instance, what should you do when the turn detector outputs an end-of-turn, but the STT model outputs a partial or empty transcript?
With Flux, we decided it was time to move towards a fully end-to-end solution by combining conversational flow modeling with transcription, a paradigm we refer to as Conversational Speech Recognition (CSR). Think of it like the first “S” in S2S; it’s a system that doesn’t just learn how to transcribe, but also how to understand the results of what it hears in the context of a conversation.
While there are other models, such as Universal-Streaming from AssemblyAI, that also claim to be STT models capable of detecting conversational events, we believe Flux offers an unmatched combination of
Flexibility: Flux’s simple yet expressive representation of conversational states makes it easier for developers to implement voice agents and permits a substantial degree of configurability,
Accuracy: Flux models conversational flow and speech recognition together, producing a smarter and more consistent experience, and
Latency: as it is constantly modeling both speech and conversational flow, Flux updates both in tandem such that neither is waiting on the other. The result is speed when it matters most, with high-quality transcripts available as soon as end-of-turn is detected.
The rest of this blog post is devoted to an in-depth analysis of the most popular cat videos on the internet (cough) I mean, a deeper dive into these aspects!
The Flux State Machine
One advantage of letting developers handle their own state machines is maximal flexibility. But, it is hard work, has a high associated cost (as mentioned above), and is not necessary for most use cases. For Flux, we developed a state machine that we believe solves a wide range of use cases, based on experience we’ve developed via our Voice Agent API. The specific user states and transitions we expose to developers are shown below:


One key decision we made above is draw a distinction between when the user stops speaking (SPEECH_STOPPED) and when they have actually yielded their turn (the EndOfTurn transition). This motivated very much by how humans think during conversations; we are generally thinking of what we want to say next while listening, but that does not mean we immediately blurt it out as soon as the user is finished speaking. While Flux, as an STT model, is not simultaneously thinking of a response, the existence of the SPEECH_STOPPED state allows developers to mimic a human’s ability to be prepared to speak immediately upon EndOfTurn; upon observation of a transition to SPEECH_STOPPED , which we call EagerEndOfTurn , developers can invoke an LLM call to generate a possible response. If EndOfTurn does occur, the response is ready to go, resulting in lower latency and a more natural conversational experience, particularly for use cases where additional work (such as RAG) needs to be done before the response is ready. If the user resumes speaking, as indicated by a TurnResumed event, that response can be discarded in favor of an updated response next time speech stops.
The additional states and events associated with SPEECH_STOPPED help with our internal conversational modeling, and the aforementioned “eager end-of-turn workflow” may provide benefits to developers who are particularly latency-sensitive. But, the other nice thing about them is that they can be ignored with impunity! In particular, staging and discarding responses can add complexity and cost due to the additional LLM calls required, and so are not suitable for all developers. Fortunately, our benchmarking indicates voice agents listening with Flux will still be able to respond sufficiently quickly to feel natural, even if only preparing responses on EndOfTurn.
Another thing we want to point out is that, while the states we expose are hard-coded, the transition conditions are not. For instance, we allow developers to configure what conditions should trigger EndOfTurn, allowing them to choose a trade-off between precision, recall, and latency that works best for them. Likewise, by default EagerEndOfTurn events are not exposed, but developers can opt-in to this functionality simply by specifying aneager_eot_thresholdas part of their request.
CSR Permits Symbiosis
As alluded to above, our motivation for combining conversational modeling with traditional STT was not just to make things easier for voice agent developers; we also observed that by doing so we could achieve better and more consistent outcomes for both!
In particular, one shortcoming of the “model all transitions independently” approach is that it tends to result in a unidirectional flow of information. Again, consider the example of end-of-turn modeling. One approach is to use an acoustic detection model such as Pipecat Smart Turn, and trigger complete, finalized transcription whenever end-of-turn is detected. Alternatively, you can use a lingustic (e.g., text-based) detector such as LiveKit EOU, which is fed by ongoing STT outputs. Each case establishes a hierarchy between models; either turn detection feeding into STT or vice versa. And, doing things “one at a time” limits functionality; e.g., waiting for end of turn to finalize transcription deprives the user of important visual cues related to what the model hears in real time.
By contrast, Flux uses bidirectional information flow between turn detection and transcription. Working transcripts stream into the part of the model responsible for conversational modeling, and the resulting conversational states provide context for the part responsible for transcription, e.g., biasing it towards certain outputs based on whether the incoming audio represents the continuation of the current turn or the start of a new turn.


One of the ways that we discovered the importance of this bidirectional information flow was through evaluating the consistency of the model outputs. Our initial prototypes used the more standard approach of relying on a conventional turn detector to identify when to finalize transcripts. However, when pushing towards faster detection, we discovered that the detector had a tendency to be a little too aggressive; at low eot_threshold, as much as 20% of true positive end-of-turn detections would occur before the person had fully finished speaking, which resulted in turn transcripts getting cut off and degraded WER. Transcription accuracy was also adversely impacted by turns with longer pauses getting cut into two, depriving the model of the “full turn” context. Analogously, when eager_eot_threshold << eot_threshold, transcripts would change non-trivially between EagerEndOfTurn and EndOfTurn for 5% of turns. Such an approach was clearly optimized for the “primary” model in the hierarchy, i.e., the end-of-turn detector, at the expense of the “secondary” transcription model.
As you can probably guess from the above, Flux is not an ensemble of different models for different tasks, but rather a single model that jointly models transcription and conversational flow (and, before you ask, no it’s not a VAD). The result is better accuracy and consistency for both turn detection and transcription. For instance, by moving to an approach that combines conversational state and STT information, we were able to significantly reduce the rates of “unhelpfully fast” end-of-turn detection and of turns cut into two. This improved the overall transcription quality, reducing WER at low eot_threshold by almost 10%. And, we were able to improve the ergonomics of the eager end-of-turn workflow, ensuring consistency between EagerEndOfTurn and final EndOfTurn transcripts.
The plot below shows WER as a function of end-of-turn detection latency, as controlled through eot_threshold , evaluated on an internal conversational benchmark dataset. Notably, Flux outputs a consistent, high-quality transcript even as turn detection parameters are tuned, with only a slight degradation at low thresholds. More generally, we found that Flux achieves comparable Word Error Rate (WER) to Nova-3 Streaming, and in particular significantly outperformed competitors on both this conversational dataset and our traditional STT benchmarking sets. Flux also supports keyterm boosting, again achieving comparable performance to Nova-3 on rare but important words.


Do not let the WER-focused examples above fool you; joint modeling also benefits conversational flow understanding. Compared to models specifically designed for end-of-turn detection, Flux was able to achieve higher precision, recall and F1 on the end-of-turn prediction task. Moreover, it was able to do so listening to only one side of the conversation; in the future, we aim to unlock even better performance through incorporating additional conversational context.


Another aspect of Flux visible in the plot above is its configurability, as evidenced by the relative spread of operating points. This indicates developers can achieve meaningful tradeoffs in latency and accuracy through their choice of threshold. Again, our experience suggests this capability is very much unlocked through joint modeling, which encourages the model to pay attention to both acoustics and linguistics. For instance, an early training curriculum biased Flux to over-index on acoustics (i.e., silence) as indicative of end-of-turn. The result was that the predicted eot_probability would jump dramatically after a certain amount of silence, such that detection latency remained fairly constant over a range of thresholds, i.e., operating points were tightly clustered in latency. Through modifying the curriculum and data, we were able to achieve better calibrated probabilities, giving rise to the aforementioned tune-ability.
One thing that is not evident in the plot above, which focuses on the median end-of-turn detection latency, is the latency spread that can be expected when using these models. A little bit of variance in detection latency is not a bad thing; some turn ends are more ambiguous than others, and humans naturally pause more or less before responding depending on, e.g., how much thought their response requires. However, for humans we expect this distribution to be somewhat regular, whereas we found that it was highly bimodal for several of the model tested, as the result of over reliance on a VAD-based “timeout” when the model is not confident. As such, the models exhibit reasonable latency just over 50% of the time, but with a long tail of very slow responses. By comparison, we found that Flux eot_probability grows steadily after the end of speech, resulting in a more normal latency spread that we believe will help Flux contribute to natural voice agent conversations.
The plot below shows the full latency percentiles for two choices of threshold, eot_threshold = 0.7 and 0.8. To highlight the accuracy impact of increasing the threshold, we treat any false positive detection as having infinite latency. You can see that increasing the threshold adds around 100ms of latency across the distribution, not just as the median, but in exchange for a ~10% increase in overall precision.


Combined Solutions Minimize Compute Latency
For all models, we define latency as the audio time elapsed after speech stopped for detection to occur, i.e., how much additional audio did the model need to see to correctly detect end-of-turn. This obfuscates additional latency associated with any compute that needs to be done upon detection, but is the best way to be fair given that said compute latency depends on how solutions are orchestrated. For instance, consider a solution where an acoustic-based end-of-turn detection triggers transcription. Such a solution would incur additional latency of t_eot + t_stt, corresponding to the time taken to run the turn detection and transcription models respectively, on top of the values reported in the plots above. In cases where you had to transcribe an entire long turn, t_stt could be substantial. Alternatively, t_stt can be reduced by transcribing the turn in short chunks, such that upon EndOfTurn you only need to transcribe the last chunk, but cutting up the turn may reduce accuracy.
By contrast, in Flux, neither is transcription waiting on end-of-turn nor vice-versa. Since both transcript and end-of-turn probability are continuously updating together, the transcript is ready to go right when it is needed, i.e., upon EndOfTurn. In practice, we have found that even under heavy load (i.e., large batches), the additional Flux compute latency is generally < O(50-100ms). I.e., for Flux, there are “no hidden fees;” the numbers shown above should be highly representative of what users experience. For more details on how we’ve optimized Flux compute, stay tuned for our upcoming post, 🐎 Accelerating Streaming STT Inference Through Custom Kernels.
The Flux Catacitor
Finally, for all of you who stuck with us to the end, the cute cat video!
If you want to see Flux in action (no cats required):
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.