Article·Tutorials·Jun 13, 2024

Transcribe Twilio Voice Calls in Real-Time with Rust and Deepgram

Learn how to transcribe Twilio Voice calls with Deepgram using real-time speech-to-text in Rust.

Pre-requisites Setting Up a TwiML Bin The Twilio Proxy Server Setup the Rust Project and main.rs state.rs, twilio_response.rs, and message.rs The WebSocket Endpoint Handlers Processing the Audio in audio.rs Running the Server and Testing with websocat Making a Docker Image for the Server Further Development

Share this guide

By Nikola WhallonSoftware Engineer

Last UpdatedJun 13, 2024

In a previous blog post, we showed how to build an integration between Deepgram and Twilio for real-time, live transcription using Python. In this post, we will revisit this integration and implement it in Rust. The Rust programming language is a favorite among Deepgram engineers, and is known for its type safety, performance, and powerful memory management achieved via a strict ownership system which eliminates entire categories of bugs!

We will be building our Twilio streaming app using the Axum web framework which is built on top of the powerful and popular asynchronous Tokio crate. Using Rust with an efficient asynchronous runtime like Tokio is a good choice for reliable and performant web app backends.

Pre-requisites

You will need:

a Twilio account with a Twilio number (the free tier will work)
a Deepgram API Key - get an API Key here
Rust installed
(optional) ngrok to let Twilio access a local server

Setting Up a TwiML Bin

We will use TwiML Bins to make Twilio fork audio data from phone calls to a server that we will write. In the Twilio Console, search for TwiML Bin, and click "Create TwiML Bin."

Give the TwiML Bin a "Friendly Name" and enter the following as the the contents of the TwiML Bin:

In the Dial section, enter your phone number. Where it says INSERT_YOUR_SERVER_URL insert the URL where you will be hosting the server. Without having to spin up and configure a cloud instance, you can use ngrok to expose a port on localhost. To do this for, say, port 5000, run:

ngrok will then generate a public URL which forwards requests to your computer at localhost:5000. This URL may look something like: c52e-71-212-124-133.ngrok.io - enter this URL in your TwiML Bin.

Now the last thing to do on the Twilio Console before hopping over to write our server code is to hook up one of your Twilio numbers to this TwiML Bin. Go to the "Develop" tab on the left side of the Twilio Console, navigate to Phone Numbers -> Manage -> Active numbers, and click on your Twilio number in the list. Then, under the field "A Call Comes In", click the drop-down and select "TwiML Bin"; for the field directly next to this one, click the drop-down and select the TwiML Bin you just created. Click "Save" at the bottom of the Twilio Console.

The Twilio Proxy Server

The system that we will be building is illustrated here:

We want audio from phone calls going through Twilio's server to be forked to the proxy server we will be writing. The proxy server then buffers and processes the audio, sends it to Deepgram, and receives transcripts back from Deepgram. The proxy server also accepts client connections which subscribe to ongoing calls, and whenever the server receives transcripts from Deepgram for those calls, it broadcasts those transcripts to all subscribers. This will all be done via WebSockets at near-real-time! Typical latencies for this system hover around 500 ms.

Download the code from this repository.

Below we will go through creating this project from scratch, but this will also act as a comprehensive code-tour of the repository. If you are keen on trying the server out right away and perusing the code more at your leisure, feel free to skip to the Running the Server and Testing with websocat section!

Setup the Rust Project and main.rs

Create a new Rust project using cargo new:

Go into the project directory and edit the Cargo.toml file, giving it the following contents:

Now let's modify src/main.rs. Let's begin by adding the use statements we will need, and defining some modules:

The modules we declared are: audio, handlers, message, state, and twilio_response. We will go over each one, but briefly these will be for the following:

audio: handle processing of audio data from Twilio
handlers: handlers for the websocket endpoints /twilio and /client
message: a helper module to convert between axum and tungstenite websocket messages
state: will contain the definition for the global state of the server
twilio_response: will contain definitions for Twilio's websocket message shape

Now, let's start defining our main function and set up the state to be shared among the handlers:

Our main function is set up to be asynchronous via the use of the #[tokio::main] macro. main and every async function that main then calls will be executed by the Tokio runtime. Inside main we grab the following environment variables:

PROXY_URL: the URL that this server will run on - by default it will use localhost and port 5000
DEEPGRAM_URL: the URL of Deepgram's streaming endpoint, including query parameters (Twilio audio uses the mulaw encoding with a sample rate of 8000, and we will be streaming stereo (2 channel) audio)
DEEPGRAM_API_KEY: your Deepgram API Key
CERT_PEM: an optional environment variable pointing to a cert.pem file used for TLS
KEY_PEM: an optional environment variable pointing to a key.pem file used for TLS

We use these environment variables to construct an Arc<State> object to store the global server state.

Now, let's finish filling in our main function by configuring our routes and spinning up the axum server to serve these routes:

The axum server is spun up with or without TLS support depending on whether or not the CERT_PEM and KEY_PEM environment variables are set.

That's all there is to main.rs! The bulk of the application logic will live in the websocket endpoint handlers, but before diving into them let's go over some of the objects the server will use.

state.rs, twilio_response.rs, and message.rs

Create the file src/state.rs and give it the following contents:

This will represent the global state of the server. The server will need to know the URL of Deepgram's streaming endpoint and a Deepgram API Key to use as authentication when connecting to this endpoint. Additionally, the server will contain a HashMap of websocket handlers for subscribers, one for each incoming connection from Twilio. These websocket handlers will be accessed via the callsid of the Twilio call, and wrapped in a Mutex to handle concurrency.

Next, create the file src/twilio_response.rs and give it the following contents:

These are just basic structs defining the shape of the messages Twilio will send our server. Feel free to checkout Twilio's documentation for more details.

Finally, create the file src/message.rs and give it the following contents:

This is also a straightforward module which creates our own websocket Message type which can be used to convert to and from axum websocket messages and tungstenite websocket messages.

The WebSocket Endpoint Handlers

Now let's get into the core logic of the server. We need to define functions to handle client/subscriber connections to /client and Twilio connections to /twilio. Let's start with the client handler.

Start by creating src/handlers/mod.rs with the following contents:

This simply declares the modules we will use to handle the client/subsriber and Twilio websocket connections.

Then, create the file src/handlers/subscriber.rs with the following contents:

As we saw in main.rs, subscriber_handler is the function which will be called when a client tries to connect to the /client endpoint of our server. From there, we perform an upgrade from HTTP to websockets. Then, we try to obtain the subscribers HashMap from our server's global state and send to the client a list of the callsids of all ongoing Twilio calls that the server is handling. The server then waits for a single message back from the client, and it interprets this message as the callsid to subscribe to. If the server receives a valid callsid, it will push the websocket handle into the subscribers HashMap. When the Twilio handler receives a transcript for that callsid, it will broadcast it to all subscribers, including the one we just pushed. That's it for subscriber.rs!

Now let's look at the bulkier twilio.rs. Create src/handlers/twilio.rs. Let's build this module piece by piece, starting with some use statements:

Incoming Twilio connections hitting /twilio will be first directed to the function twilio_handler where the websocket upgrade will be performed. Then handle_socket will split the websocket connection into a receiver and a sender, open up an entirely new websocket connection to Deepgram, split the Deepgram websocket connection into a receiver and a sender, and spawn tasks which call the functions handle_to_subscribers and handle_from_twilio which take these receivers and senders as arguments. A oneshot channel is also set up so that handle_from_twilio can send the callsid of the Twilio call to handle_to_subscribers in a thread-safe manner - the callsid is not yet known when these initial websocket connections are established, it only becomes available when Twilio sends this information in a Twilio start event websocket message.

Let's now define the handle_to_subscribers function:

This function first waits to receive the callsid from handle_from_twilio and then proceeds to read messages off the Deepgram websocket receiver, broadcasting all messages that it obtains to all subscribers to that callsid.

Now let's define handle_from_twilio as follows:

This function begins by setting up an object to help handle the audio buffers from the inbound and outbound callers. We then start reading websocket messages from the Twilio websocket receiver. When we obtain the Twilio start event message, we can grab the callsid, use it to set up subscribers to this call, and send it off to the handle_to_subscribers task via the oneshot channel we set up earlier. Subsequent Twilio media events are then processed via audio::process_twilio_media, and when a buffer of mixed stereo audio is ready, we send it to Deepgram via the Deepgram websocket sender.

Finally, when Twilio closes the connection to our server (or some error occurs), we must remember to remove all subscribers from the subscriber HashMap and close the connections to those subscribers.

Processing the Audio in audio.rs

When discussing the Twilio websocket handler, the processing of Twilio media events was delegated to audio::process_twilio_media. We will define this function in src/audio.rs. Make src/audio.rs with the following contents:

Twilio sends its audio data as 8000 Hz mulaw data, independently for inbound and outbound callers. Additionally, sometimes Twilio (or the phones which use Twilio) will drop packets of audio. The function process_twilio_media, then, handles inserting silence should there be dropped packets or timing issues, and mixes together the inbound and outbound audio into a valid stereo audio stream which we can then send to Deepgram. Several of the finer details are explained in the comments in this file.

Running the Server and Testing with websocat

Let's use websocat to quickly test our server.

Run the server with the following:

replacing INSERT_YOUR_DEEPGRAM_API_KEY with your Deepgram API Key.

This server will need to be accessible to Twilio, and this is set up in the TwiML Bin you created in the previous Setting Up a TwiML Bin section. If you are using ngrok, make sure your TwiML Bin is updated with the current ngrok URL.

Now, call your Twilio number with one phone, and answer the call on the phone your Twilio number forwards to. Then, latch onto the proxy server via the client endpoint with websocat:

Websocat should immediately send you a message containing a list of the callsids of all active calls (which in this case should be one). Reply to the message with the callsid by copy/pasting and hitting enter:

You should start to see transcription results appear in your websocat session in real time:

Feel free to try setting up multiple Twilio numbers, and multiple client sessions!

Making a Docker Image for the Server

Let's go through the process of building a Docker image so that this server can be portably deployed. We'll start by making a rust-toolchain file with the following contents:

(quite the simple file!). This will ensure that when you run cargo build (either manually, or as part of building a Docker image), the same version of Rust will be used every time.

Now, let's create a Dockerfile called Dockerfile and give it the following contents:

Replace YOUR_INFO with your name and email address (for me, for example, this would be Nikola Whallon <nikola@deepgram.com>). The key bits to take away are:

we start with an Ubuntu 22.04 image
we install several dependencies via apt
we use the rust-toolchain and build+install our executable with cargo install
we set the ENTRYPOINT to /bin/deepgram-twilio-streaming-rust, with no command-line arguments (CMD)

Now with the Dockerfile written, build the Docker image with:

If you will be pushing this image to Docker Hub so that the image can be pulled from a remote server (like an AWS instance), replace your-docker-hub-account with your Docker Hub account. For local testing, simply using the image name deepgram-twilio-streaming-rust:0.1.0 (or whatever you would like) will work. You are also free to pull and use deepgram/deepgram-twilio-treaming-rust:0.1.0!

Now you can run the Docker image in a container locally via:

replacing INSERT_YOUR_DEEPGRAM_API_KEY with your Deepgram API Key, and make sure the Docker image name matches what you built. This will run the image in a container in your current terminal, but you can include a -d to detach the process to run in the background. If you do this, you will need to keep track of whether or not it is running with docker ps and similar commands.

Refer to the Docker CLI documentation for more info.

Now that the Twilio proxy server should be running in a Docker container, feel free to give your Twilio number a call, and subscribe to the call with websocat by doing:

and replying to the server with the callsid it sends you.

You should be all set to push this Docker image to your Docker Hub (or use ours: deepgram/deepgram-twilio-treaming-rust:0.1.0), and pull and use it on your cloud server! You will need to provide the additional environment variables CERT_PEM and KEY_PEM to do this, making sure those files are accessible to the Docker continer by using -v, and you may need to specify the port as 443 in the PROXY_URL and use -p 443:443 among other subtle changes. You should refer to your cloud server provider's documentation on setting up an https/wss enabled server with certificates. As an example, here's how I spun up the server app on an AWS Ubuntu 20.04 instance:

Further Development

This should get you up and running with an almost-production-ready Twilio-Deepgram proxy server, written in Rust. There are a few pieces that have been left out, for the sake of brevity and for the sake of being agnostic to the needs of your desired system. For example, calls to the /client endpoint are currently entirely unauthenticated, and indeed calls to /twilio are also unauthenticated (see these Twilio docs for more details). For a fully-production-ready service, you should take authentication into consideration. Also, no logging or telemetry is presented in the proxy server.

Finally, you will very likely need to build a front-end to interact with the server and properly parse the JSON messages being streamed. websocat is great for testing, but is not a reasonable final solution for subscribing to calls!

If you have any questions, please feel free to reach out on Twitter - we're @DeepgramAI.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.