ArticleĀ·AI Engineering & ResearchĀ·Sep 22, 2025
15 min read

Working with Timestamps, Utterances, and Speaker Diarization in Deepgram

Speech-to-text (STT) is the baseline; the value is in the metadata. Deepgram returns timestamped transcripts with utterances and speaker diarization so you can answer who said what, when—reliably. Follow this in-depth tutorial to make the most of your STT models!
15 min read
By Stephen Oladele
Last Updated

šŸŽ„ Video OverviewĀ 

ā© TL;DR

  • STT is the baseline; the value is in the metadata. Deepgram returns timestamped transcripts with utterances and speaker diarization so you can answer who said what, when—reliably.

  • Turn utterances=true, diarize=true, and smart_format=true on Deepgram’s STT API to get structured, context-aware results out of the box.

  • With this, you can power search, segmentation, QA checks, and analytics (talk time, interruptions, and silence).

  • We’ll normalize Deepgram’s response into a reusable ā€œturnsā€ schema and use it to build a simple speaker-aware subtitles demo.

  • Follow the guide to learn more and build voice AI applications on Deepgram’s feature-rich STT API.

Introduction

Speech-to-text (STT) isn’t just about words—it’s about context. In real conversations, who spoke, when they spoke, and how long they spoke often matter as much as what they said. A flat, speakerless transcript throws away that structure.

Deepgram’s speaker diarization STT and utterance segmentation give you timestamped transcripts that preserve conversational context:

  • Utterances: semantically coherent segments with start/end times.

  • Speaker labels: identify who spoke in each segment.

  • Word-level timing (optional): fine-grained alignment when you need it.

In this guide, you’ll:

  1. Call Deepgram with utterances, diarize, and smart_format enabled.

  2. Understand the response (words → utterances → speakers).

  3. Normalize it into a ā€œturnsā€ data model you can store, index, and analyze.

  4. Build a speaker-aware subtitles demo application and outline downstream recipes for search, QA, and analytics.

We’ll also cover production realities—overlapping speech, short back-channels (ā€œmm-hmmā€), and multi-channel recordings—so your pipeline is robust, not just a demo.

šŸ‘€ Preview: here’s how the same clip becomes useful data—Speaker A (00:04–00:12) asks a question; Speaker B (00:12–00:18) answers; you can jump playback by utterance, compute talk-time, and flag policy terms.

Speech Is Made of Contextual Segments—Text Should Be Too!

Human speech isn’t a neat paragraph; it comes in segments like questions, answers, interjections, and hand-offs. To keep that structure, your transcript needs three things:

  • Utterances: coherent segments of speech with a single intent.

  • Timestamps: start/end times for each segment (and words) so you can align UI, video, or actions.

  • Speaker diarization: labels that answer who spoke each segment.

Together, these turn a wall of text into a timestamped transcript you can navigate, search, QA, and analyze.

How Deepgram exposes this (out of the box):

  • Enable utterances=true to receive results.utterances[] with { start, end, transcript, confidence, words[] }.

  • Enable diarize=true to add speaker labels per utterance (and per word).

  • Enable smart_format=true for readable punctuation/casing and normalized entities (numbers, dates, etc.).

  • If your source is multi-channel (e.g., agent/customer on separate channels), you can process per channel; if it’s single-channel with multiple speakers, use speaker diarization STT.

What to use when:

  • Subtitles and media controls: utterance timestamps for segment jumps; word timestamps for precise cueing.

  • Search and retrieval: index utterance text with (start, end, speaker) so results jump to the right moment.

  • QA and compliance: compute talk time, interruptions, and policy coverage from utterance boundaries.

  • Analytics: aggregate per-speaker metrics (duration, WPM, silence between turns).

We’ll normalize Deepgram’s response into a reusable turns schema { id, start, end, speaker, text, words[], channel } and use it to power subtitles, search, QA checks, and dashboards.

In the next section, you’ll call the STT API with utterances, diarize, and smart_format enabled, inspect the JSON, and map it into the turns schema to cover edge cases like short back-channels and overlapping speech along the way.

Working with Utterances, Timestamps, and Diarization in Deepgram

With Nova-2 and Nova-3, Deepgram can return timestamped transcripts complete with utterances and speaker diarization STT—all in one call. To enable these features, all you need to do is pass the corresponding parameters in your API request.Ā 

Here’s a simple curl example that transcribes a prerecorded audio file using the nova-3 model with utterance, speaker diarization, and smart formatting on:

What those flags do

šŸ’”Tip: Dual-channel call recordings? Skip diarize and map by channel instead.

Here’s a single utterance (truncated):

Python SDK (prerecorded)

Running this prints a fully enriched transcript, which is the foundation for the speaker-aware subtitles you’ll build next. In production you’d:

  1. Store the utterances[] array as a ā€œturnsā€ table.

  2. Convert it to WebVTT/SRT for media players.

  3. Index (start, end, speaker, text) for search and QA analytics.

šŸ‘‰ Heads-up on accuracy: Short back-channels (ā€œuh-huhā€) or overlapping speech can yield lower diarization confidence. We’ll show simple post-processing fixes later.

šŸ’” Tips:

  • Want fewer micro-segments? Tune utterance boundary sensitivity with utt_split (e.g., utt_split=1.0).

  • See model options for nova-3 and variants.

Adding Captions to Audio with Deepgram

Deepgram’s timestamped transcripts (with utterances=true and optional diarize=true) map cleanly to standard caption formats that browsers and media players understand: SRT (SubRip Text) and WebVTT (Web Video Text Tracks).

Rather than hand-coding time math and line-wrapping, you can use Deepgram’s caption helpers to emit valid files in one step.Ā 

(If you just need a file on disk, check out the one-liner helper in deepgram-captions.)

Option A: Use the Caption Helpers

Turning the timestamped transcriptsĀ  JSON into usable captions is a two-step job with deepgram-caption:

  1. Get an enriched transcript (utterances=true&diarize=true&smart_format=true).

  2. Serialize each utterance into SRT or WebVTT cues.

This handles WEBVTT header, milliseconds separators, line breaks, and sane cue splitting.

Option B: DIY formatter (if you need full control)

Now you have used the helper, if you need more control over the subitilt generations, you can write a Python script to convert Deepgram transcripts into either SRT or WebVTT. Each utterance from the transcript will be matched to a time interval for SRT or a cue for WebVTT.

First, make sure your DEEPGRAM_API_KEY is set as an environment variable. Then import the required modules:

Set the path to your audio or video file:

Next, create a function to format Deepgram’s timestamps. Since timestamps are returned in seconds, you need to convert them into the correct format for each subtitle type.

Write a function to convert utterances into WebVTT cues. Each cue contains a start and end time, and a speaker label.

You’ll also write a similar function to generate an SRT file. Each block will be numbered and contain a time range and the spoken text.

With your custom helper functions ready, write the main function that uses Deepgram to transcribe an audio file, then convert the results into the format you choose.

And with that, you’ve completed one half of the project. You can get the full app.py script here: SRT and WebVTT Subtitle generator.

Production Pointers

  • Line length and duration: keep cues ≤ 42 chars and ≤ 6 s. Split long utterances if needed.

  • Overlap: if back-channels (ā€œuh-huhā€) overlap, shift them ±100 ms or merge.

  • Async transcription: use transcribe_file_async + webhook for large batches.

What’s Next?

Now that we can export captions, you’ll build a speaker-aware video player that:

  1. Loads the WebVTT file.

  2. Highlights the active speaker cue.

  3. Lets users click an utterance to seek.

āž”ļø Jump to the next section!

Building a Speaker-Aware Captioning Application

In the previous section, you generated subtitle files, such as WebVTT, using Deepgram’s transcription output. These files contain both the text and metadata, such as timestamps and speaker labels.Ā 

Once you have a subtitle file, you can use it directly in popular video players like VLC or build a custom HTML5 video player that colors each speaker’s lines in real time.

Why Not the Default <track>?

You could simply attach the VTT file:

Browsers will render captions automatically—and you can even style them per speaker with ::cue(v[voice="Speaker 1"]) { color: coral; }.Ā 

However, if you need full control (custom fonts, live transcript pane, analytics hooks) you’ll want your own overlay. You’ll build a simple custom web video player that:

  • Loads a video and a corresponding WebVTT file

  • Parses the cues and displays them as styled captions

  • Assigns each speaker a unique color

  • Updates captions live as the video plays

HTML: Basic Structure

We’ll start by creating the HTML layout for our player (see repo).

This layout includes two file inputs for uploading a video and a .vtt file, a video player element, and a box where you will display the active caption.

JavaScript: Core Logic

Let’s move to JavaScript (see repo). This is where we:

  • Load and parse the VTT file

  • Track video time updates

  • Display cues with speaker-specific colors

Each cue corresponds to a WebVTT cue, which includes start, end, text, and speaker.

Assign unique colors to speakers

See an example of how the interface might look like:

Result: Load any video + Deepgram-generated WebVTT and watch speaker-colored captions update live.

šŸ‘‰ Heads-up: WebVTT cues shorter than ~500 ms can flash. Feel free to merge adjacent cues or tweak utt_split before exporting. For production, use the browser’s native TextTrack API and ::cue CSS for simpler styling if you don’t need the overlay.Ā 

With this setup, you now have a fully functional and customizable caption viewer. It reads Deepgram’s diarized WebVTT output, assigns colors to speakers, and renders them in sync with your video.

Next Steps → Analytics

Now that you have a working player, you can extend it to:

  • Log per-speaker talk time and interruption counts.

  • Jump to a cue when a keyword search hits.

  • Export usage metrics to your QA dashboard.

We’ll outline one of those pipelines in the conclusion.

Conclusion: Transcription Is Just the Beginning

Treating speech as structured data—not just words—unlocks real product value. With Deepgram you get timestamped transcripts, utterances, and speaker diarization out of the box, so you can answer who said what, and when and wire that context into your UI and pipelines.

Key takeaways

  • Utterances give you segment-level meaning for navigation, search, and analytics.

  • Timestamps align text to media and actions (seek, highlight, chaptering).

  • Speaker diarization turns multi-party audio into attributed turns for QA and insights.

Go Further With Built-in options

  • Readability and normalization: smart_format.

  • Languages: language or detect_language (pick the right model for your domain).

  • Safety & compliance: profanity_filter, redact (e.g., PII patterns).

Production Checklist

  • Choose single-channel + diarization vs multi-channel input based on your source.

  • Smooth the stream: merge ultra-short back-channels and handle overlaps before exporting captions.

  • Log and monitor: talk-time, interruptions, silence, WPM; add alerts for policy/PII terms.

  • Evaluate regularly: sample reviews plus diarization metrics (e.g., DER) on a held-out set.

  • Store a normalized turns schema (start, end, speaker, text, words[], channel) for reuse across search, QA, and analytics.

What to Build Next?

šŸ” Sign up at Deepgram → Get $200 in free credits to jumpstart your voice AI apps:

  • Searchable player: index utterances, jump to hits, highlight speakers (see an example).

  • Meeting assistant: per-speaker summaries + action items with timecodes.

  • Compliance/QA: policy coverage checks and redaction audits with review queues.

  • Analytics dashboard: talk-time by speaker, interruption rate, average pause, topic spans.

And when you’re ready to scale, plug the same turns data into your retrieval or analytics stack. You’ll ship features faster—and your users will get transcripts that behave like conversations, not walls of text.

Other Resources

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.