Evaluating End-of-Turn (Turn Detection) Models

Overview
Full Conversational Evaluation
The Trouble with Timestamps
Using Sequence Alignment to Improve Turn Boundary Detection Evaluation
Sllllllliiiiiddddddde to the left…
Turn Start Evaluation
What’s Next

Share this guide

By Jack Kearney

Staff Research Scientist

By Chau Luu

Senior Research Scientist

Last Updated

Oct 28, 2025

Overview

A key contributor in how natural a voice agent interaction feels is the turn taking behavior of that agent. An agent that is too eager to respond may frequently interrupt the user, causing friction, whereas a more conservative system may feel slow and unresponsive.

Performing the task of quickly and accurately predicting when someone has yielded their conversational turn and is awaiting a response (End-of-Turn/EoT) is a crucial part of building natural interactions with voice agents. As such, there are an increasing number of models and solutions for detecting when someone has finished speaking. These can range from all-in-one ASR + EoT solutions like Deepgram Flux or AssemblyAI Universal-Streaming, to audio-based solutions like Pipecat Smart Turn, Krisp Turn-Taking, or even to transcript-based solutions like the LiveKit EoU model.

When developing Flux, we wanted to make sure we would be providing customers with the best and fastest EoT detection. This meant we needed a way to evaluate various solutions, both internally and to compare Flux to other approaches. Moreover, we wanted to ensure we were evaluating performance on the actual problem we are trying to solve with the STT system; how often do you correctly detect EoT, and how long did it take you to do so after the user has finished speaking?

To do so, we did not want to rely on proxy objectives sometimes used, such as word finalization time or delay between predicted time of last word and EoT detection time. In particular, these can “hide” latency in time taken for finalization to occur. Similarly, we did not want to analyze each turn independently; while this makes it easy to, e.g., scan over thresholds, we found that mistakes made in a true streaming setting could have surprising “knock-on” effects. For instance, a premature EoT detection during a turn could impact the latency at the actual turn end by distorting the context (the model actually interprets this as two separate turns instead of the groundtruth one). Finally, we did not want to rely on “vibes,” since we quickly discovered models that “felt good” when testing with clean, high-quality inputs from our laptops and noise-cancelling mics did not necessarily perform as well in challenging real-world scenarios, even those as mundane as phone conversations!

So, unsatisfied by other analyses we had seen, we decided to develop a new approach we felt would be maximally reliable for evaluating STT models, and particularly turn detection, in a voice agent setting.

Full Conversational Evaluation

When evaluating Flux, we do not carry out inference over individual turns but rather over complete conversations between humans. This does not fully reflect the voice agent scenario; in a voice agent scenario, if EoT detection occurred prematurely, the user would generally not just finish their thought but adapt in response to the agent starting to speak (most humans being more adaptable than most voice agents 😉). However, we believe using “true” conversations gives an invaluable added level of realism to our evaluation. In real conversations:

we have access to a highly natural, accurate and low latency turn detector in the form of the counter-party, which helps provide high-quality labels.
we can observe a “reasonable upper limit” on EoT detection time based on when the person starts speaking again. This realistically models the fact that not all turns have the same detection budget; for simple queries that beg quick answers requiring minimal thinking, users will be less tolerant of long pauses arising from EoT. Our approach allows us to appropriately penalize models that fail to detect EoT within this reasonable period.
we observe “natural” pauses that occur before turns, which allows us to investigate other aspects of conversational modeling such as “start of speech” detection or, in the future, back-channeling.

To evaluate our models, we labeled over 100 hours of real conversations with both groundtruth transcripts and timestamped EoT labels. In doing so, we learned a little lesson about annotation task specification. In our initial spec, we used the term “End-of-Thought” instead of “End-of-Turn,” which caused confusion among annotators who (rightfully) realized that a “thought” was much more ambiguous than a “turn.” For instance, depending on what you define as a thought, “It’s nice to meet you. Thank you for coming today.” might be two thoughts, or one. And, we do not really care about this distinction; what we really want to know is whether it would be natural for a voice agent to start speaking where the label is applied. Clarification, and more precise terminology, improved self-reported annotator confidence from 5/10 to 8.5/10.

The Trouble with Timestamps

Generally, running Flux (or any model that supports turn detection) over these data gives us a series of times at which EoT detection occurred. To determine accuracy and latency, we need to combine this series with the groundtruth series of true EoT occurrences.

Our intention was to use the timestamps associated with the EoT labels described above, but here we ran into a surprising issue; human labels were not quite accurate enough to accurately reflect Flux performance! Humans tended to be somewhat conservative in their label placement, leaving a small gap between the end of the last word and the label…and for many simple cases Flux was able to detect EoT within this gap! To address this issue, we used forced alignment to “update” the human timestamps. This approach allowed us to refine our timestamps, attaching even greater value to our human labels.

One argument against employing such a “correction” could be that the human label more accurately represents when it would be appropriate to start speaking. Specifically, humans do not generally start speaking the instant the counter-party stops but rather after a brief pause, of which the gap left by human annotators may be indicative. However, in natural conversations, speakers do frequently begin their turn just as the last turn is ending, and turns can naturally bleed into one another. Moreover, for voice agents specifically, their turn starts are frequently delayed relative to EoT detection due to extra latency associated with LLM and TTS generation. So, for the STT component of such an agent, it is advantageous to detect EoT as soon as possible, such that when the agent actually responds it will feel prompt and natural. For these reasons, we decided the “end of speech” timestamps extracted from forced alignment were more representative of our use case.

Once we had our new and improved series of groundtruth EoT timestamps, the second challenge was how to join it to the predicted series. The naive approach would be to join each groundtruth turn end time t_i,end to the earliest detection that occurs after t_i,end but before the start of the next turn

This effectively requires that “correct” detections have to occur after speech is finished. But, this is not really a true representation of the problem; in principle, a sophisticated model that fires aggressively might be able to detect EoT before a word is completed. This might have negative impact on transcription accuracy (cutting off a transcript early) but, given the extrapolative capabilities of modern STT systems, it also might not. In addition, ordering-based approaches require highly accurate timestamps, which we might not get for all turns. So, we wanted to avoid imposing this requirement.

A simple relaxation is to allow EoT detection to occur slightly early, i.e., require

But, now we have introduced a new hyperparameter to which our results are sensitive, and anyway manual inspection revealed this did not significantly improve reliability.

Using Sequence Alignment to Improve Turn Boundary Detection Evaluation

Relying on time-based alignment left us dissatisfied; we felt the metrics produced by the above approaches were noisy, imprecise and, in particular, underestimated the true performance of the various turn detection algorithms. Then, we had an idea; what if instead of relying on temporal alignment we used sequence alignment. Sequence alignment is common in STT; alignment between groundtruth transcript and prediction transcript is used to compute the word error rate (WER) of STT systems. By treating turn boundaries as just another customer word/token ([EoT]) in the transcript, we could effectively use the additional context provided by the transcript to better detect when an EoT prediction might align with a particular groundtruth.

As an example of how this can help, consider the following case

Based purely on temporal alignment, since the prediction was 180ms early, we would ding the model with both a false positive and a false negative. But, from the alignment, it’s pretty clear that the model was closer to correct than wrong, just a bit early. So, by generating alignments that combine both turn boundaries and transcripts, we get a more accurate assessment of system performance and can gain a more nuanced understanding of how the full system is performing.

Changing from a purely time-based evaluation algorithm had a huge impact; we saw 3-5% absolute increases in precision and recall across models, both Flux and others we evaluated (e.g., Pipecat Smart Turn, LiveKit EOU, etc.). And, manual investigation of results suggested the new values were more representative of the true performance. Note that these improvements held for all-in-one solutions, text-based turned detectors and audio-only models, though for the latter we had to add transcripts for the periods between detected turns. To do so, we used Nova-3, because (a) many platforms (such as Vapi or Cartesia) offer Nova-3 + audio turn detector as an all-in-one solution, (b) doing so reduced the likelihood of biasing comparison relative to using Flux, and (c) it’s the best 😉

Sllllllliiiiiddddddde to the left…

Our initial algorithm relied on a standard Levenshtein alignment implementation, but we quickly discovered one important modification was needed to handle a very particular edge case: dropped turns. Consider the following scenario:

TRUTH:      Hi  Chau! [EoT-1] I'm fine thanks. [EoT-2] Sure that sounds great. [EoT-3]
PREDICTION: Hi!                                 [EoT]  Sure that sounds great.  [EoT]

In this case, our STT system has dropped the middle turn (”I’m fine thanks.”). As a result, from an alignment perspective, where to align the first [EoT] predicted is ambiguous; aligning with either [EoT-1] or [EoT-2] would be valid. In reality though, even just based on text, the alignment with [EoT-1] seems substantially more plausible, and timestamps can provide additional evidence which is more likely.

To better handle “dropped turns,” we implemented a modified version of the Levenshtein algorithm that determines the best alignment not only based on overall edit distance but also based on the most likely alignment between EoT tokens. In spite of the name of this subsection, this algorithm is slightly more complicated than simply sliding [EoT] tokens to the left — for instance, unthinkingly doing so above would end up aligning [EoT] with Chau! — but doing so ultimately improved performance.

Turn Start Evaluation

In addition to detecting when a turn ends, it can be important to detect when someone starts speaking, i.e., start-of-turn (SoT). Fast SoT detection is an important component of voice agent pipelines for handling cases where the user wants to interrupt the agent, or barge in. Specifically, the longer it takes to detect an interruption, and thus stop the agent speaking, the longer the period of overlapping speech and the less natural the interaction will feel.

To evaluate SoT, we look at the first non-trivial “speech detection” for the STT system following a successful EoT detection, and compare that to the groundtruth timestamp for the first word in the next turn. We use the word start time, since this is when the turn started, but note that this generally represents an unachievable lower bound; rather, it is much more reasonable to expect detection by the end of the first word. One advantage of defining SoT as the start of the first word is that negative SoT latency can be directly interpreted as a “false positive,” i.e., cases where the model detected the next turn before it actually started. In a voice agent, this would likely correspond to interrupting the voice agent before the user had actually said anything.

Below, we show Flux SoT detection latency, as well as the duration distribution of the first words in the turns. We see that Flux generally detects SoT within ~100-200ms of a typical first word duration, and exhibits a low false positive rate ≲ 1-2%. Note that, while we compared to another STT-based solution in Universal Streaming, we did not compare to pure VAD-based solutions for two reasons. First, their performance depends on the EoT solution with which they are paired. Second, they do not guarantee an actionable transcript at SoT detection. E.g., such solutions would not be viable if you wanted to use the initial prediction to determine whether the user speech might represent backchanneling as opposed to an attempted interruption.

The above analysis considers all possible turns in our conversational dataset. This is not ideal since, unlike EoT, SoT latency is only relevant for the subset of turns that correspond to an interruption. If you are not currently speaking, and are instead waiting for someone else to say something, there is nothing immediately actionable about when they start, only when they end. However, identifying “true” interruptions is challenging and ambiguous, and anyway it is not clear that the pattern of human-human interruptions matches that of human-agent interruptions. Consequently, we decided this approach was more representative of Flux’s SoT detection capabilities than an approach based on filtering to “possible” interruptions.

What’s Next

Once you start considering the whole conversational aspect of speech, and move beyond Word Error Rate (WER), there’s a whole host of metrics one can explore, and different ways they can be calculated. For instance, one could combine the various aspects of turn detection, such as false positive rate and latency, into single “quality” metrics, such as the Voice Agent Quality Index (VAQI) developed by our Voice Agent Engineering Team. We believe a particularly interesting frontier is to incorporate a notion of semantics/conversation flow into these metrics, i.e., stratified or weighted metrics that reflect that some turns merit faster responses than others. We hope to be able to share more of our fun research on that with you on that front soon!