Working with Timestamps, Utterances, and Speaker Diarization in Deepgram

By Stephen Oladele

Last Updated

Sep 22, 2025

🎥 Video Overview

⏩ TL;DR

STT is the baseline; the value is in the metadata. Deepgram returns timestamped transcripts with utterances and speaker diarization so you can answer who said what, when—reliably.
Turn utterances=true, diarize=true, and smart_format=true on Deepgram’s STT API to get structured, context-aware results out of the box.
With this, you can power search, segmentation, QA checks, and analytics (talk time, interruptions, and silence).
We’ll normalize Deepgram’s response into a reusable “turns” schema and use it to build a simple speaker-aware subtitles demo.
Follow the guide to learn more and build voice AI applications on Deepgram’s feature-rich STT API.

Introduction

Speech-to-text (STT) isn’t just about words—it’s about context. In real conversations, who spoke, when they spoke, and how long they spoke often matter as much as what they said. A flat, speakerless transcript throws away that structure.

Deepgram’s speaker diarization STT and utterance segmentation give you timestamped transcripts that preserve conversational context:

Utterances: semantically coherent segments with start/end times.
Speaker labels: identify who spoke in each segment.
Word-level timing (optional): fine-grained alignment when you need it.

In this guide, you’ll:

Call Deepgram with utterances, diarize, and smart_format enabled.
Understand the response (words → utterances → speakers).
Normalize it into a “turns” data model you can store, index, and analyze.
Build a speaker-aware subtitles demo application and outline downstream recipes for search, QA, and analytics.

We’ll also cover production realities—overlapping speech, short back-channels (“mm-hmm”), and multi-channel recordings—so your pipeline is robust, not just a demo.

👀 Preview: here’s how the same clip becomes useful data—Speaker A (00:04–00:12) asks a question; Speaker B (00:12–00:18) answers; you can jump playback by utterance, compute talk-time, and flag policy terms.

Speech Is Made of Contextual Segments—Text Should Be Too!

Human speech isn’t a neat paragraph; it comes in segments like questions, answers, interjections, and hand-offs. To keep that structure, your transcript needs three things:

Utterances: coherent segments of speech with a single intent.
Timestamps: start/end times for each segment (and words) so you can align UI, video, or actions.
Speaker diarization: labels that answer who spoke each segment.

Together, these turn a wall of text into a timestamped transcript you can navigate, search, QA, and analyze.

How Deepgram exposes this (out of the box):

Enable utterances=true to receive results.utterances[] with { start, end, transcript, confidence, words[] }.
Enable diarize=true to add speaker labels per utterance (and per word).
Enable smart_format=true for readable punctuation/casing and normalized entities (numbers, dates, etc.).
If your source is multi-channel (e.g., agent/customer on separate channels), you can process per channel; if it’s single-channel with multiple speakers, use speaker diarization STT.

What to use when:

Subtitles and media controls: utterance timestamps for segment jumps; word timestamps for precise cueing.
Search and retrieval: index utterance text with (start, end, speaker) so results jump to the right moment.
QA and compliance: compute talk time, interruptions, and policy coverage from utterance boundaries.
Analytics: aggregate per-speaker metrics (duration, WPM, silence between turns).

We’ll normalize Deepgram’s response into a reusable turns schema { id, start, end, speaker, text, words[], channel } and use it to power subtitles, search, QA checks, and dashboards.

In the next section, you’ll call the STT API with utterances, diarize, and smart_format enabled, inspect the JSON, and map it into the turns schema to cover edge cases like short back-channels and overlapping speech along the way.

Working with Utterances, Timestamps, and Diarization in Deepgram

With Nova-2 and Nova-3, Deepgram can return timestamped transcripts complete with utterances and speaker diarization STT—all in one call. To enable these features, all you need to do is pass the corresponding parameters in your API request.

Here’s a simple curl example that transcribes a prerecorded audio file using the nova-3 model with utterance, speaker diarization, and smart formatting on:

curl \
  -X POST \
  -H "Authorization: Token YOUR_DEEPGRAM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://static.deepgram.com/examples/Bueller-Life-moves-pretty-fast.wav"}' \
  "https://api.deepgram.com/v1/listen?model=nova-3&language=en&utterances=true&diarize=true&smart_format=true"

What those flags do

Flag	Purpose
utterances=true	Returns results.utterances[] with start, end, speaker, transcript, words[].
diarize=true	Adds speaker IDs to each utterance and word (single-channel audio).
smart_format=true	Autocasing, punctuation, numeric normalization.

💡Tip: Dual-channel call recordings? Skip diarize and map by channel instead.

Here’s a single utterance (truncated):

{
  "start": 10.0713,
  "end": 11.6713,
  "confidence": 0.9994,
  "channel": 0,
  "transcript": "Life moves pretty fast.",
  "words": [
    {"word":"Life","start":10.0713,"end":10.3113,"confidence":0.9990,"speaker":0},
    {"word":"moves","start":10.3113,"end":10.6313,"confidence":0.9997,"speaker":0}
  ],
  "speaker": 0,
  "id": "35404fd9-12ed-4b58-a9a9-9eaf93475ef6"
}

Python SDK (prerecorded)

from deepgram import DeepgramClient, PrerecordedOptions

dg = DeepgramClient("YOUR_SECRET")

AUDIO_URL = {"url": "https://static.deepgram.com/examples/Bueller-Life-moves-pretty-fast.wav"}

opts = PrerecordedOptions(
    model="nova-3",
    language="en",
    smart_format=True,
    utterances=True,
    diarize=True,
)

res = dg.listen.prerecorded.v("1").transcribe_url(AUDIO_URL, opts)

print(res.to_json(indent=2))

Running this prints a fully enriched transcript, which is the foundation for the speaker-aware subtitles you’ll build next. In production you’d:

Store the utterances[] array as a “turns” table.
Convert it to WebVTT/SRT for media players.
Index (start, end, speaker, text) for search and QA analytics.

👉 Heads-up on accuracy: Short back-channels (“uh-huh”) or overlapping speech can yield lower diarization confidence. We’ll show simple post-processing fixes later.

💡 Tips:

Want fewer micro-segments? Tune utterance boundary sensitivity with utt_split (e.g., utt_split=1.0).
See model options for nova-3 and variants.

Adding Captions to Audio with Deepgram

Deepgram’s timestamped transcripts (with utterances=true and optional diarize=true) map cleanly to standard caption formats that browsers and media players understand: SRT (SubRip Text) and WebVTT (Web Video Text Tracks).

Rather than hand-coding time math and line-wrapping, you can use Deepgram’s caption helpers to emit valid files in one step.

Format	Time syntax	Speaker tags	Best for
SRT	HH:MM:SS,mmm	Not standard (you embed names in the text)	Offline editors, legacy players
WebVTT	HH:MM:SS.mmm	<v Speaker N> native	HTML5 <track> & styling via ::cue

(If you just need a file on disk, check out the one-liner helper in deepgram-captions.)

Option A: Use the Caption Helpers

Turning the timestamped transcripts JSON into usable captions is a two-step job with deepgram-caption:

Get an enriched transcript (utterances=true&diarize=true&smart_format=true).
Serialize each utterance into SRT or WebVTT cues.

from deepgram import DeepgramClient, PrerecordedOptions
from deepgram_captions import DeepgramConverter, webvtt, srt # pip install deepgram-captions

dg = DeepgramClient()  # reads DEEPGRAM_API_KEY from env
with open("PATH_TO_YOUR_FILE", "rb") as f:
    payload = f.read()

opts = PrerecordedOptions(
    model="nova-3",
    language="en",
    smart_format=True,
    utterances=True,
    diarize=True,
)

res = dg.listen.rest.v("1").transcribe_file(payload, opts)
dg_tx = DeepgramConverter(res.to_dict())  # convert SDK object to plain dict

# Write WebVTT (native in <track>)
with open("output.vtt", "w", encoding="utf-8") as out:
    out.write(webvtt(dg_tx))

# Or write SRT
with open("output.srt", "w", encoding="utf-8") as out:
    out.write(srt(dg_tx))

This handles WEBVTT header, milliseconds separators, line breaks, and sane cue splitting.

Option B: DIY formatter (if you need full control)

Now you have used the helper, if you need more control over the subitilt generations, you can write a Python script to convert Deepgram transcripts into either SRT or WebVTT. Each utterance from the transcript will be matched to a time interval for SRT or a cue for WebVTT.

First, make sure your DEEPGRAM_API_KEY is set as an environment variable. Then import the required modules:

import os
from datetime import timedelta
from deepgram import DeepgramClient, PrerecordedOptions, FileSource

Set the path to your audio or video file:

AUDIO_FILE = "PATH_TO_YOUR_FILE"

Next, create a function to format Deepgram’s timestamps. Since timestamps are returned in seconds, you need to convert them into the correct format for each subtitle type.

def format_timestamp(seconds: float, format_type="vtt") -> str:
    """Convert float seconds to timestamp based on format."""
    td = timedelta(seconds=seconds)
    hours, remainder = divmod(td.seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    milliseconds = int(td.microseconds / 1000)

    if format_type == "srt":
        return f"{hours:02}:{minutes:02}:{seconds:02},{milliseconds:03}"
    else:  # WebVTT
        return f"{hours:02}:{minutes:02}:{seconds:02}.{milliseconds:03}"

Write a function to convert utterances into WebVTT cues. Each cue contains a start and end time, and a speaker label.

def write_webvtt_file(utterances, output_file="output.vtt"):
    with open(output_file, "w", encoding="utf-8") as f:
        f.write("WEBVTT\n\n")
        for utt in utterances:
            start = format_timestamp(utt.start, "vtt")
            end = format_timestamp(utt.end, "vtt")
            speaker = utt.speaker if hasattr(utt, "speaker") else 0
            transcript = utt.transcript

            f.write(f"{start} --> {end}\n")
            f.write(f"<v Speaker {speaker}>{transcript}\n\n")
    print(f"✅ WEBVTT file saved as: {output_file}")

You’ll also write a similar function to generate an SRT file. Each block will be numbered and contain a time range and the spoken text.

def write_srt_file(utterances, output_file="output.srt"):
    with open(output_file, "w", encoding="utf-8") as f:
        for idx, utt in enumerate(utterances, start=1):
            start = format_timestamp(utt.start, "srt")
            end = format_timestamp(utt.end, "srt")
            transcript = utt.transcript

            f.write(f"{idx}\n")
            f.write(f"{start} --> {end}\n")
            f.write(f"{transcript}\n\n")
    print(f"✅ SRT file saved as: {output_file}")

With your custom helper functions ready, write the main function that uses Deepgram to transcribe an audio file, then convert the results into the format you choose.

And with that, you’ve completed one half of the project. You can get the full app.py script here: SRT and WebVTT Subtitle generator.

Production Pointers

Line length and duration: keep cues ≤ 42 chars and ≤ 6 s. Split long utterances if needed.
Overlap: if back-channels (“uh-huh”) overlap, shift them ±100 ms or merge.
Async transcription: use transcribe_file_async + webhook for large batches.

What’s Next?

Now that we can export captions, you’ll build a speaker-aware video player that:

Loads the WebVTT file.
Highlights the active speaker cue.
Lets users click an utterance to seek.

➡️ Jump to the next section!

Building a Speaker-Aware Captioning Application

In the previous section, you generated subtitle files, such as WebVTT, using Deepgram’s transcription output. These files contain both the text and metadata, such as timestamps and speaker labels.

Once you have a subtitle file, you can use it directly in popular video players like VLC or build a custom HTML5 video player that colors each speaker’s lines in real time.

Why Not the Default <track>?

You could simply attach the VTT file:

<video controls src="sample.mp4" width="640">
  <track default kind="subtitles" src="captions.vtt" srclang="en" label="English">
</video>

Browsers will render captions automatically—and you can even style them per speaker with ::cue(v[voice="Speaker 1"]) { color: coral; }.

However, if you need full control (custom fonts, live transcript pane, analytics hooks) you’ll want your own overlay. You’ll build a simple custom web video player that:

Loads a video and a corresponding WebVTT file
Parses the cues and displays them as styled captions
Assigns each speaker a unique color
Updates captions live as the video plays

*To build a video player with captioning support, start by using a speech-to-text (STT) model like Nova 3 to generate transcripts enriched with metadata such as utterances, timestamps, and speaker diarization. Next, convert these transcripts into a standard captioning format like WebVTT. Once converted, the caption file can be linked to the video and styled appropriately within the video player.*

HTML: Basic Structure

We’ll start by creating the HTML layout for our player (see repo).

This layout includes two file inputs for uploading a video and a .vtt file, a video player element, and a box where you will display the active caption.

JavaScript: Core Logic

Let’s move to JavaScript (see repo). This is where we:

Load and parse the VTT file
Track video time updates
Display cues with speaker-specific colors

Each cue corresponds to a WebVTT cue, which includes start, end, text, and speaker.

Assign unique colors to speakers

function getSpeakerColor(speakerId) {
  if (!speakerColors[speakerId]) {
    const hue = (Object.keys(speakerColors).length * 137) % 360;
    speakerColors[speakerId] = `hsl(${hue}, 80%, 60%)`;
  }
  return speakerColors[speakerId];
}

See an example of how the interface might look like:

Result: Load any video + Deepgram-generated WebVTT and watch speaker-colored captions update live.

👉 Heads-up: WebVTT cues shorter than ~500 ms can flash. Feel free to merge adjacent cues or tweak utt_split before exporting. For production, use the browser’s native TextTrack API and ::cue CSS for simpler styling if you don’t need the overlay.

With this setup, you now have a fully functional and customizable caption viewer. It reads Deepgram’s diarized WebVTT output, assigns colors to speakers, and renders them in sync with your video.

💻 Fork the full demo repo here: DeepGram Video Player

Next Steps → Analytics

Now that you have a working player, you can extend it to:

Log per-speaker talk time and interruption counts.
Jump to a cue when a keyword search hits.
Export usage metrics to your QA dashboard.

We’ll outline one of those pipelines in the conclusion.

Conclusion: Transcription Is Just the Beginning

Treating speech as structured data—not just words—unlocks real product value. With Deepgram you get timestamped transcripts, utterances, and speaker diarization out of the box, so you can answer who said what, and when and wire that context into your UI and pipelines.

Key takeaways

Utterances give you segment-level meaning for navigation, search, and analytics.
Timestamps align text to media and actions (seek, highlight, chaptering).
Speaker diarization turns multi-party audio into attributed turns for QA and insights.

Go Further With Built-in options

Readability and normalization: smart_format.
Languages: language or detect_language (pick the right model for your domain).
Safety & compliance: profanity_filter, redact (e.g., PII patterns).

Production Checklist

Choose single-channel + diarization vs multi-channel input based on your source.
Smooth the stream: merge ultra-short back-channels and handle overlaps before exporting captions.
Log and monitor: talk-time, interruptions, silence, WPM; add alerts for policy/PII terms.
Evaluate regularly: sample reviews plus diarization metrics (e.g., DER) on a held-out set.
Store a normalized turns schema (start, end, speaker, text, words[], channel) for reuse across search, QA, and analytics.

What to Build Next?

🔐 Sign up at Deepgram → Get $200 in free credits to jumpstart your voice AI apps:

Searchable player: index utterances, jump to hits, highlight speakers (see an example).
Meeting assistant: per-speaker summaries + action items with timecodes.
Compliance/QA: policy coverage checks and redaction audits with review queues.
Analytics dashboard: talk-time by speaker, interruption rate, average pause, topic spans.

And when you’re ready to scale, plug the same turns data into your retrieval or analytics stack. You’ll ship features faster—and your users will get transcripts that behave like conversations, not walls of text.

Other Resources

📘 Read the API docs → Deepgram STT endpoints.
📞 Book a 30-min consult → Need help scaling? Our engineers will review your stack and share best practices.
Check out the Deepgram community to join other voice AI app builders.