Article·Tutorials·Jun 13, 2024

Transcribing Twilio Voice Calls in Real-Time with Deepgram

In this tutorial, learn how to transcribe Twilio Voice calls with Deepgram with real-time speech-to-text.

Pre-requisites Setting Up A TwiML Bin The Twilio Proxy Server Running the Server and Testing with WebSocat Further Development

Share this guide

By Nikola WhallonSoftware Engineer

Last UpdatedJun 13, 2024

Pre-requisites Setting Up A TwiML Bin The Twilio Proxy Server Running the Server and Testing with WebSocat Further Development

Twilio is a very popular voice platform, and Deepgram is a great automatic speech recognition (ASR) solution, so there is great value in integrating the two. This tutorial will guide you through building an integration that allows multiple client subscribers to watch live transcripts from ongoing Twilio calls. The code for this tutorial is located here.

Pre-requisites

You will need:

A Twilio account with a Twilio number (the free tier will work).
A Deepgram API Key - get an API Key here.
(Optional) ngrok to let Twilio access a local server.

Setting Up A TwiML Bin

We need to tell Twilio to fork audio data from calls going to your Twilio number to the server we are going to write. This is done via "TwiML Bin"s. In the Twilio Console, search for TwiML Bin, and click "Create TwiML Bin."

Give the TwiML Bin a "Friendly Name" - something like "Streaming" or "Deepgram Streaming," and then make the contents of the TwiML Bin the following:

Replace the number in the Dial section with the phone number you want incoming calls to be forwarded to (this should not be your Twilio number, it should be the number of a real phone in your possession!). Then, where it says INSERT_YOUR_SERVER_URL type in the URL where you will be running your server. Without having to setup an AWS or DigitalOcean server, you can use ngrok to expose a local server. To expose port 5000 on your computer, you can use ngrok as follows:

ngrok will then tell you the public URL which points to your localhost:5000. Your URL may end up looking something like: c52e-71-212-124-133.ngrok.io.

Now, we need to connect this TwiML Bin to your Twilio phone number. Go to the "Develop" tab on the left side of the Twilio Console, navigate to Phone Numbers -> Manage -> Active numbers, and click on your Twilio number in the list. Then, under the field "A Call Comes In," click the drop-down and select "TwiML Bin"; for the field directly next to this one, click the drop-down and select "Streaming" (or whatever your TwiML Bin's "Friendly Name" is). Finally, click "Save" at the bottom of the Twilio Console. Everything on Twilio's side should now be set up, and we are ready to move on to the Deepgram integration server!

The Twilio Proxy Server

Let's take a look at the system we are building here:

We have pairs of callers, inbound and outbound, and, for each call passing through Twilio's servers, Twilio is able to fork the audio from the call to our proxy server via websockets. Our server then has to do some light processing of that audio, forward it on to Deepgram, receive transcripts back from Deepgram, and forward those transcripts on to potentially multiple clients who are subscribed to watch the call's transcripts. So in order to view real-time transcripts in a client application, our backend server must maintain a minimum of three websockets connections - we can see how this can get complicated, especially when dealing with many concurrent Twilio calls and subscribed clients!

Download the code from this repository. It contains a single file, twilio.py!

Let's look at the code (make sure to replace INSERT_YOUR_DEEPGRAM_API_KEY with your Deepgram API Key):

This server uses the Python websocket library to connect to Twilio, Deepgram, and client applications, and the asyncio library to handle concurrent connections. The server has two routes: /twilio and /client. As we have configured in our TwiML Bin, Twilio will be connecting to and sending audio data to the /twilio endpoint, and we will use the /client endpoint for client applications which will watch the streaming transcripts.

The server uses a dictionary, called subscribers, to handle concurrent connected clients. Specifically, subscribers is a dictionary whose keys are Twilio callSids which uniquely identify calls, and whose values are a list of queues for clients who are "subscribed" to those calls (i.e. watching for streaming transcripts from those calls).

To dive into the code, let's look at the client_handler function. When a client connects to the /client endpoint, the client_handler function will first send a websocket message to the client listing the callSids of all currently streaming calls. The function then waits to receive a websocket message which it expects to be the callSid of the call that the client wants to view live transcripts for (and if the function does not receive a valid callSid, it will bail). Having received a valid callSid, the function then inserts this client's queue into the subscribers dictionary and starts an async task which reads from this queue, sending transcription results back to the client via websocket messages, or gracefully closing the websocket connection if the message "close" was received on the queue.

Now let's jump into the more involved twilio_handler function. This function handles incoming websocket connections from Twilio, and begins by setting up a queue for audio data, and a queue to handle passing the incoming callSid between async tasks. It then connects to Deepgram and sets up three async tasks: deepgram_receiver, deepgram_sender, and twilio_receiver (we will never send websocket messages back to Twilio, hence no "twilio_sender" task).

The twilio_receiver task handles incoming websocket messages from Twilio. Before Twilio sends audio data, it will send some metadata as part of a start event. One of these pieces of metadata is the callSid of the call, and we will pass that on to the deepgram_receiver task via a queue. Then, when Twilio starts streaming media (i.e. audio) events, we will perform some logic to buffer and mix this audio. In particular, Twilio will stream audio in via separate inbound and outbound audio tracks; we must make sure we mix these two audio tracks together as correct stereo audio to pass on to Deepgram. Some issues arise if call packets are dropped from one of these tracks, and logic is implemented with ample comments to deal with this without having the two channels in the mixed stereo audio get out of sync. Finally, with correctly mixed audio buffers prepared, twilio_receiver will pass this audio on to the deepgram_sender task via a queue. The deepgram_sender task then simply passes this audio on to Deepgram via the Deepgram websocket handle.

Finally, we get to the deepgram_receiver task. In order to pass transcripts from Deepgram on to subscribed clients, we must first know the callSid of the call, so the first thing deepgram_receiver does is wait to obtain this from the twilio_receiver via a queue. Once the callSid is obtained, the deepgram_receiver is then able to forward on all transcription results from Deepgram to all clients subscribed to that callSid. It does this via another queue, which is handled by the async task defined in client_handler, and thus we come full circle.

Running the Server and Testing with WebSocat

To run the server, first pip3 install the websockets, pydub, and asyncio libraries, and then run:

If you are running this on your own cloud server, make sure port 5000 is accessible. If you followed the optional suggestion of using ngrok, this should be all set up simply by running ngrok http 5000 on a separate terminal.

To quickly test the integration, start a call to your Twilio number - this call will be forwarded to the phone number in the Dial section of your TwiML Bin, so you will need two phones (so feel free to grab a friend, or set up a Google Voice account or something similar!).

After the phone call has started, use a tool like websocat to connect to ws://localhost:5000/client. Upon connecting, the server should output a list of the callSids of ongoing calls (it should be a list of exactly one call at this point); reply to the server with one of these callSids and watch the Deepgram transcription responses roll in! You can start multiple clients and have them all subscribe to the same callSid to see how a concurrent system could work.

Further Development

The Deepgram-Twilio integration design presented here is slightly opinionated, in the interest of getting a reasonably complete demo up and running. You may want to factor in authentication, as the /client endpoint explained here is completely unauthenticated. You also may want to find an alternate way of labelling calls to subscribe to - instead of grabbing callSids, one could subscribe directly to Twilio numbers, but this would require extra Twilio API integration to look up the status of calls to your Twilio numbers.

Another clear next step would be to develop a proper client application. Programs like websocat are fantastic for testing, but you will likely want to design a front-end application which handles selecting callSids to subscribe to, parses and formats the Deepgram transcription response, and possibly other features.

If you have any questions, please feel free to reach out on Twitter - we're @DeepgramAI.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.