Article·Tutorials·Jun 13, 2024

How to Build an OpenAI Whisper API

In this blog, learn step-by-step how to build an API for OpenAI Whisper, an open-source automatic speech recognition model.

The Gestalt of API's Making It Real Testing the API To the Moon!

Share this guide

By Adam SypniewskiCTO

Last UpdatedJun 13, 2024

The Gestalt of API's Making It Real Testing the API To the Moon!

So, you've probably heard about OpenAI's Whisper model; if not, it's an open-source automatic speech recognition (ASR) model – a fancy way of saying "speech-to-text" or just "speech recognition." What makes Whisper particularly interesting is that it works with multiple languages (at the time of writing, it supports 99 languages) and also supports translation into English. It also has a surprisingly low word error rate (WER) out-of-the-box.

Whisper makes it pretty easy to invoke at the command line, as a CLI:

And here's an example of its language detection at work:

And if you don't read Spanish, you can use the CLI to translate:

Okay, so maybe that wasn't a very good translation...

CLI's are incredibly useful for getting things working locally fast. But they don't scale well if you want to hook up other software systems. They aren't good for builders.

The Gestalt of API's

The moment you start thinking like a builder, you want things that you can piece together. Things that you can scale. Components that can be combined into more than the sum of their parts. That's where APIs come in: you can build services that provide value to any other piece of your system that you want.

Want to build a notetaking app that joins your Zoom calls, records the audio, and saves the transcript for browsing later? Well, you probably don't want to call whisper at the command line. You want a little service running, just waiting for requests. You want an API.

So, let's build one. Specifically, let's build an HTTP API that we can send HTTP POST requests to with a tool like curl or Postman. And let's do it in the data science language du jour – Python.

The first thing we need to pick out is a web server framework. There are lots available and range from full-fledged development platforms like Django, to simple synchronous frameworks like Flask, to pure-Python asynchronous frameworks like Tornado.

For this example, let's stick with Flask. It does everything we need without bringing too much extra support to the table, and is one of the simplest and easiest web frameworks to get started with. Let's install it:

Let's look at what a "Hello, World!" application looks like in Flask:

Well, that looks simple. Does it run? First, save your file as app.py. Now, to run it:

By default, Flask listens on port 5000. So let's try hitting our hello-world API endpoint:

Awesome! It's working! But how do we get our user's or client's data into Flask? That example curl command didn't send any file to our Flask server. In fact, our Flask app above only handled HTTP GET requests, and it turns out that GET requests can't have data (or "bodies," in HTTP parlance) attached to them. But don't worry! We just need to change our Flask app to handle POST requests and the data that comes attached to them. This isn't hard, either: we just need to tell Flask that our handler will accept POST requests:

Okay, yeah, that was easy. Now let's put the actual logic for handling an uploaded file (i.e., the "body"). But we need a place to put it. Let's create a temporary file to hold the file.

Let's try running it (if you named your file app.py, this is just flask run). Can we send a file to it? Let's try the snippet we downloaded earlier:

Perfect. Now we need to connect it to Whisper.

Making It Real

At this point, it's time to get Whisper installed:

Whisper also requires ffmpeg to be installed. Use your system package manager to get it installed (apt, pacman, brew, choco, etc.) - the package is usually just called ffmpeg.

Now, what does a minimal code snippet look like to get Whisper running using Python? Well, something like this:

Okay. So we load a model and then give it a file to transcribe. That should be easy to add to our Flask app. We only need to load the model once, so we can do that at the top of our app. And we are already writing uploaded data to a temporary file, so it is extra easy. Let's modify the Flask app:

Okay, everyone. Drumroll, please!

Testing the API

Run the Flask app, just like ever: flask run. And now let's submit our file:

HOLY CRAP IT WORKED!

And because we wrote the Flask app to loop over all submitted files, we can submit multiple files at once:

Okay, that's seriously cool. If you have jq installed, you can pipe the output of curl into it for easier reading; otherwise, you use python -m json.tool as a poor man's jq for pretty printing:

Beautiful.

To the Moon!

Congratulations! You now have a full-fledged HTTP API at your fingertips. What will you build now?

Here are some ideas for your speech recognition server:

What features can you add to the API output? Take a look at the Deepgram documentation for some inspiration.
Hook up to an RSS feed to automatically transcribe your favorite podcasts.
Monitor a local directory and automatically transcribe any audio files that land there.
Build a voice-controlled car.

Happy building!

Shortcut: If you've skipped to the bottom and decided you don't want to build an API yourself, you're in luck. Deepgram hosts Whisper on it's API. Check it out.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.