Article·Tutorials·Sep 23, 2022

Building a Conversational AI Flow with Deepgram

Shir Goldberg
By Shir Goldberg
PublishedSep 23, 2022
UpdatedJun 13, 2024

How do you know when someone is finished talking? Before I started working at Deepgram, I hadn’t thought about this question much. When having conversations in person, us humans can use all sorts of contextual cues, body language, and societal norms to figure out when someone has finished their thought and we can jump in with our own opinion. But as we’ve all seen over Zoom during the last few years, figuring out when someone is done talking is a lot harder to do virtually. It’s even harder when the listener isn’t human at all—and is a machine learning model transcribing speech!

Business problems that need speech-to-text often also need an understanding of when a speaker has completed their thought. One common use case for this is building conversational AI bots that need to respond to a user’s queries. The bot needs to be careful both to not to cut the user off, and to respond in a timely enough manner that the conversation feels “real-time”.

Deepgram’s real-time speech-to-text service provides two main mechanisms that can help build a conversational flow. One is interim results, and the other is endpointing. Together, the two can give you information about when a speaker has finished talking, and when your system should respond.

Interim results, which are disabled by default, are sent back every few seconds. These messages, marked with is_final=false, indicate that Deepgram is still gathering more audio and the transcription results may change as additional context is given. Once Deepgram has collected enough audio to make the best possible prediction, it will finalize the prediction and send back a transcript marked with is_final=true.

Endpointing, which is enabled by default, is an algorithm that detects the end of speech. When endpointing triggers, Deepgram immediately sends back a message. These messages will be marked with speech_final=true to indicate an endpoint was detected and is_final=true to indicate the transcription is finalized.

The simplest way to determine when someone is done talking is based on silence. Endpointing can give you almost immediate feedback when silence is detected, which may be useful for applications that prioritize quick processing of results. Here’s a code example that uses your microphone and the Python package beepy to play a notification sound when Deepgram detects an endpoint.

To run the code, install beepy using pip install beepy. Then save the following code as endpointing.py. Turn your volume up so you’ll be able to hear the beep sound, and run the code:

python endpointing.py -k 'YOUR_DG_API_KEY'

As you may notice when using endpointing.py, Deepgram will detect that you have finished speaking as soon as possible, meaning that in a conversational flow, this logic can easily cut you off mid-sentence every time you make even a minor pause. Rather than responding immediately, many applications will want to wait for a few seconds after a speaker finishes talking. This can be especially effective in conversational AI, where users may be speaking for long durations and occasionally pause mid-thought—waiting a few seconds to respond may result in a more natural conversational flow. A combination of endpointing and interim results can be used to determine when a desired duration of silence has passed.

Here’s a code example that uses your microphone and the Python package beepy to play a notification sound after the number of seconds defined in a configurable SILENCE_INTERVAL has passed. (The default is 2.0, but this can be specified when running the script.)

To run the code, install beepy using pip install beepy. Then save the following code as silence_interval.py. Turn your volume up so you’ll be able to hear the beep sound, and run the code:

python silence_interval.py -k 'YOUR_DG_API_KEY' [-s SILENCE_INTERVAL_SECONDS_FLOAT]

These two examples should give you an idea of how different conversational flow mechanisms feel, and how they can be incorporated into different types of real-time speech-to-text applications. Both can be found in this GitHub repo.

We hope these examples help as you decide how to best utilize Deepgram's functionality. Happy building!

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.