So, you want to run the OpenAI Whisper tool on your machine? You can load it from the OpenAI Github repository to get up and going!
Setup
You'll need python on your machine, at least version 3.7. Let's set up a virtual environment with venv (or conda or the like) if you want to isolate these experiments from other work.
Next, install a clone of the Whisper package and its dependencies (torch, numpy, transformers, tqdm, more-itertools, and ffmpeg-python) into your python environment.
Especially if it's pulling torch for the first time, this may take a little while. The repository documentation advises that if you get errors building the wheel for tokenizers, you may also need to install rust. You'll also need ffmpeg - installation depends on your platform. Here are some examples:
Using the Tool
Great! You're ready to transcribe! In this example, we're working with Nicholas Tesla's vision of a wireless future - you can get this audio file at the LibriVox archive of public-domain audiobooks and bring it to your local machine if you don't have something queued up and ready to go.
The OpenAI Whisper tool has a variety of models that are English-only or multilingual, and come in a range of sizes whose tradeoffs are speed vs. performance. You can learn more about this here. We, the researchers at Deepgram, have found that the small model provides a good balance.
Deepgram's Whisper API Endpoint
Getting the Whisper tool working on your machine may require some fiddly work with dependencies - especially for Torch and any existing software running your GPU. Our OpenAI Whisper API endpoint is easy to work with on the command-line - you can use curl to quickly send audio to our API.
This call will send your file to the API and save it to a local JSON file called n_tesla.json:
The JSON file is returned in Deepgram's format that offers the transcript as well as information about your transcription request. A quick way to view the transcript is to use the jq tool:
...and here's your transcript!
But Wait. Those Transcripts Aren't the Same.
Excellent observation! The local run was able to transcribe "LibriVox," while the API call returned "LeapRvox." This is an artifact of this kind of model - their results are not deterministic. That is, some optimizations for working with large quantities of audio depend on overall system state and do not produce precisely the same output between runs. Our observations are that the resulting differential is typically on the order of 1% (absolute) fluctuations in word-error rate.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.