Working with Timestamps, Utterances, and Speaker Diarization in Deepgram

š„ Video OverviewĀ
ā© TL;DR
STT is the baseline; the value is in the metadata. Deepgram returns timestamped transcripts with utterances and speaker diarization so you can answer who said what, whenāreliably.
Turn utterances=true, diarize=true, and smart_format=true on Deepgramās STT API to get structured, context-aware results out of the box.
With this, you can power search, segmentation, QA checks, and analytics (talk time, interruptions, and silence).
Weāll normalize Deepgramās response into a reusable āturnsā schema and use it to build a simple speaker-aware subtitles demo.
Follow the guide to learn more and build voice AI applications on Deepgramās feature-rich STT API.
Introduction
Speech-to-text (STT) isnāt just about wordsāitās about context. In real conversations, who spoke, when they spoke, and how long they spoke often matter as much as what they said. A flat, speakerless transcript throws away that structure.
Deepgramās speaker diarization STT and utterance segmentation give you timestamped transcripts that preserve conversational context:
Utterances: semantically coherent segments with start/end times.
Speaker labels: identify who spoke in each segment.
Word-level timing (optional): fine-grained alignment when you need it.
In this guide, youāll:
Call Deepgram with utterances, diarize, and smart_format enabled.
Understand the response (words ā utterances ā speakers).
Normalize it into a āturnsā data model you can store, index, and analyze.
Build a speaker-aware subtitles demo application and outline downstream recipes for search, QA, and analytics.
Weāll also cover production realitiesāoverlapping speech, short back-channels (āmm-hmmā), and multi-channel recordingsāso your pipeline is robust, not just a demo.
š Preview: hereās how the same clip becomes useful dataāSpeaker A (00:04ā00:12) asks a question; Speaker B (00:12ā00:18) answers; you can jump playback by utterance, compute talk-time, and flag policy terms.
Speech Is Made of Contextual SegmentsāText Should Be Too!
Human speech isnāt a neat paragraph; it comes in segments like questions, answers, interjections, and hand-offs. To keep that structure, your transcript needs three things:
Utterances: coherent segments of speech with a single intent.
Timestamps: start/end times for each segment (and words) so you can align UI, video, or actions.
Speaker diarization: labels that answer who spoke each segment.
Together, these turn a wall of text into a timestamped transcript you can navigate, search, QA, and analyze.
How Deepgram exposes this (out of the box):
Enable utterances=true to receive results.utterances[] with { start, end, transcript, confidence, words[] }.
Enable diarize=true to add speaker labels per utterance (and per word).
Enable smart_format=true for readable punctuation/casing and normalized entities (numbers, dates, etc.).
If your source is multi-channel (e.g., agent/customer on separate channels), you can process per channel; if itās single-channel with multiple speakers, use speaker diarization STT.
What to use when:
Subtitles and media controls: utterance timestamps for segment jumps; word timestamps for precise cueing.
Search and retrieval: index utterance text with (start, end, speaker) so results jump to the right moment.
QA and compliance: compute talk time, interruptions, and policy coverage from utterance boundaries.
Analytics: aggregate per-speaker metrics (duration, WPM, silence between turns).
Weāll normalize Deepgramās response into a reusable turns schema { id, start, end, speaker, text, words[], channel } and use it to power subtitles, search, QA checks, and dashboards.
In the next section, youāll call the STT API with utterances, diarize, and smart_format enabled, inspect the JSON, and map it into the turns schema to cover edge cases like short back-channels and overlapping speech along the way.
Working with Utterances, Timestamps, and Diarization in Deepgram
With Nova-2 and Nova-3, Deepgram can return timestamped transcripts complete with utterances and speaker diarization STTāall in one call. To enable these features, all you need to do is pass the corresponding parameters in your API request.Ā
Hereās a simple curl example that transcribes a prerecorded audio file using the nova-3 model with utterance, speaker diarization, and smart formatting on:
What those flags do
š”Tip: Dual-channel call recordings? Skip diarize and map by channel instead.
Hereās a single utterance (truncated):
Python SDK (prerecorded)
Running this prints a fully enriched transcript, which is the foundation for the speaker-aware subtitles youāll build next. In production youād:
Store the utterances[] array as a āturnsā table.
Convert it to WebVTT/SRT for media players.
Index (start, end, speaker, text) for search and QA analytics.
š Heads-up on accuracy: Short back-channels (āuh-huhā) or overlapping speech can yield lower diarization confidence. Weāll show simple post-processing fixes later.
š” Tips:
Want fewer micro-segments? Tune utterance boundary sensitivity with utt_split (e.g., utt_split=1.0).
See model options for nova-3 and variants.
Adding Captions to Audio with Deepgram
Deepgramās timestamped transcripts (with utterances=true and optional diarize=true) map cleanly to standard caption formats that browsers and media players understand: SRT (SubRip Text) and WebVTT (Web Video Text Tracks).
Rather than hand-coding time math and line-wrapping, you can use Deepgramās caption helpers to emit valid files in one step.Ā
(If you just need a file on disk, check out the one-liner helper in deepgram-captions.)
Option A: Use the Caption Helpers
Turning the timestamped transcriptsĀ JSON into usable captions is a two-step job with deepgram-caption:
Get an enriched transcript (utterances=true&diarize=true&smart_format=true).
Serialize each utterance into SRT or WebVTT cues.
This handles WEBVTT header, milliseconds separators, line breaks, and sane cue splitting.
Option B: DIY formatter (if you need full control)
Now you have used the helper, if you need more control over the subitilt generations, you can write a Python script to convert Deepgram transcripts into either SRT or WebVTT. Each utterance from the transcript will be matched to a time interval for SRT or a cue for WebVTT.
First, make sure your DEEPGRAM_API_KEY is set as an environment variable. Then import the required modules:
Set the path to your audio or video file:
Next, create a function to format Deepgramās timestamps. Since timestamps are returned in seconds, you need to convert them into the correct format for each subtitle type.
Write a function to convert utterances into WebVTT cues. Each cue contains a start and end time, and a speaker label.
Youāll also write a similar function to generate an SRT file. Each block will be numbered and contain a time range and the spoken text.
With your custom helper functions ready, write the main function that uses Deepgram to transcribe an audio file, then convert the results into the format you choose.
And with that, youāve completed one half of the project. You can get the full app.py script here: SRT and WebVTT Subtitle generator.
Production Pointers
Line length and duration: keep cues ⤠42 chars and ⤠6 s. Split long utterances if needed.
Overlap: if back-channels (āuh-huhā) overlap, shift them ±100 ms or merge.
Async transcription: use transcribe_file_async + webhook for large batches.
Whatās Next?
Now that we can export captions, youāll build a speaker-aware video player that:
Loads the WebVTT file.
Highlights the active speaker cue.
Lets users click an utterance to seek.
ā”ļø Jump to the next section!
Building a Speaker-Aware Captioning Application
In the previous section, you generated subtitle files, such as WebVTT, using Deepgramās transcription output. These files contain both the text and metadata, such as timestamps and speaker labels.Ā
Once you have a subtitle file, you can use it directly in popular video players like VLC or build a custom HTML5 video player that colors each speakerās lines in real time.
Why Not the Default <track>?
You could simply attach the VTT file:
Browsers will render captions automaticallyāand you can even style them per speaker with ::cue(v[voice="Speaker 1"]) { color: coral; }.Ā
However, if you need full control (custom fonts, live transcript pane, analytics hooks) youāll want your own overlay. Youāll build a simple custom web video player that:
Loads a video and a corresponding WebVTT file
Parses the cues and displays them as styled captions
Assigns each speaker a unique color
Updates captions live as the video plays


HTML: Basic Structure
Weāll start by creating the HTML layout for our player (see repo).
This layout includes two file inputs for uploading a video and a .vtt file, a video player element, and a box where you will display the active caption.
JavaScript: Core Logic
Letās move to JavaScript (see repo). This is where we:
Load and parse the VTT file
Track video time updates
Display cues with speaker-specific colors
Each cue corresponds to a WebVTT cue, which includes start, end, text, and speaker.
Assign unique colors to speakers
See an example of how the interface might look like:
Result: Load any video + Deepgram-generated WebVTT and watch speaker-colored captions update live.
š Heads-up: WebVTT cues shorter than ~500 ms can flash. Feel free to merge adjacent cues or tweak utt_split before exporting. For production, use the browserās native TextTrack API and ::cue CSS for simpler styling if you donāt need the overlay.Ā
With this setup, you now have a fully functional and customizable caption viewer. It reads Deepgramās diarized WebVTT output, assigns colors to speakers, and renders them in sync with your video.
Next Steps ā Analytics
Now that you have a working player, you can extend it to:
Log per-speaker talk time and interruption counts.
Jump to a cue when a keyword search hits.
Export usage metrics to your QA dashboard.
Weāll outline one of those pipelines in the conclusion.
Conclusion: Transcription Is Just the Beginning
Treating speech as structured dataānot just wordsāunlocks real product value. With Deepgram you get timestamped transcripts, utterances, and speaker diarization out of the box, so you can answer who said what, and when and wire that context into your UI and pipelines.
Key takeaways
Utterances give you segment-level meaning for navigation, search, and analytics.
Timestamps align text to media and actions (seek, highlight, chaptering).
Speaker diarization turns multi-party audio into attributed turns for QA and insights.
Go Further With Built-in options
Readability and normalization: smart_format.
Languages: language or detect_language (pick the right model for your domain).
Safety & compliance: profanity_filter, redact (e.g., PII patterns).
Production Checklist
Choose single-channel + diarization vs multi-channel input based on your source.
Smooth the stream: merge ultra-short back-channels and handle overlaps before exporting captions.
Log and monitor: talk-time, interruptions, silence, WPM; add alerts for policy/PII terms.
Evaluate regularly: sample reviews plus diarization metrics (e.g., DER) on a held-out set.
Store a normalized turns schema (start, end, speaker, text, words[], channel) for reuse across search, QA, and analytics.
What to Build Next?
š Sign up at Deepgram ā Get $200 in free credits to jumpstart your voice AI apps:
Searchable player: index utterances, jump to hits, highlight speakers (see an example).
Meeting assistant: per-speaker summaries + action items with timecodes.
Compliance/QA: policy coverage checks and redaction audits with review queues.
Analytics dashboard: talk-time by speaker, interruption rate, average pause, topic spans.
And when youāre ready to scale, plug the same turns data into your retrieval or analytics stack. Youāll ship features fasterāand your users will get transcripts that behave like conversations, not walls of text.
Other Resources
š Read the API docs ā Deepgram STT endpoints.
š Book a 30-min consult ā Need help scaling? Our engineers will review your stack and share best practices.
Check out the Deepgram community to join other voice AI app builders.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.