Providing captions for audio and video isn't just a nice-to-have - it's critical for accessibility. While this isn't specifically an accessibility post, I wanted to start by sharing Microsoft's Inclusive Toolkit. Something I hadn't considered before reading this was the impact of situational limitations. To learn more, jump to Section 3 of the toolkit - "Solve for one, extend to many". Having a young (read "loud") child, I've become even more aware of where captions are available, and if they aren't, I simply can't watch something with her around.
There are two common and similar caption formats we are going to generate today - WebVTT and SRT. A WebVTT file looks like this:
And a SRT file looks like this:
Both are very similar in their basic forms, except for the millisecond separator being . in WebVTT and , in SRT. In this post, we will generate them manually from a Deepgram transcription result to see the technique, and then use the brand new Node.js SDK methods (available from v1.1.0) to make it even easier.
Before We Start
You will need:
Node.js installed on your machine - download it here.
A Deepgram API Key - get one here.
A hosted audio file URL to transcribe - you can use https://static.deepgram.com/examples/deep-learning-podcast-clip.wav if you don't have one.
Create a new directory and navigate to it with your terminal. Run npm init -y to create a package.json file and then install the Deepgram Node.js SDK with npm install @deepgram/sdk.
Set Up Dependencies
Create an index.js file, open it in your code editor, and require then initialize the dependencies:
Get Transcript
To be given timestamps of phrases to include in our caption files, you need to ask Deepgram to include utterances (a chain of words or, more simply, a phrase).
Create a Write Stream
Once you open a writable stream, you can insert text directly into your file. When you do this, pass in the a flag, and any time you write data to the stream, it will be appended to the end. Inside of the .then() block:
Write Captions
The WebVTT and SRT formats are very similar, and each requires a block of text per utterance.
WebVTT
Deepgram provides seconds back as a number (15.4 means 15.4 seconds), but both formats require times as HH:MM:SS.milliseconds and getting the end of a Date().toISOString() will achieve this for us.
Using the SDK
Replace the above code with this single line:
SRT
Differences? No WEBVTT line at the top, millisecond separator is ,, and no - before the utterance.
Using the SDK
Replace the above code with this single line:
One Line to Captions
We actually implemented .toWebVTT() and .toSRT() straight into the Node.js SDK while writing this post. Now, it's easier than ever to create valid caption files automatically with Deepgram. If you have any questions, please feel free to reach out on Twitter - we're @DeepgramAI.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.