I’ll admit, I’m a bit of a sci-fi nerd. Since 5th grade, I’ve wanted to build the Iron Man suit. And the moment my family finished watching Back to the Future II, I immediately started creating blueprints for a Hoverboard. I was 9.

Now older and wiser—and closer to Tony Stark’s level than I was in elementary school—I’ve decided to use the technology at my disposal to bring my favorite sci-fi tools to life. And my first venture into the land of gadgets-and-gizmos-aplenty is the Universal Translator.

You know the one. It’s how Rick and Morty understand beings from other planets. It’s how all the aliens on the Starship Enterprise communicate. It’s how The Doctor can speak with the Daleks.

And, most importantly, it’s how we’re going to translate BBC Radio from English to French.

First off, we need AI

Because natural human language contains numerous inconsistencies and lacks the logical rigor enforced in our personal computers, we’re going to need AI.

But that’s good news!

If you haven’t yet heard, artificial intelligence is  📈booming 📈 right now. As a result, computer translations have vastly improved. We simply need to leverage already-existing tech to build our Universal Translator. 

Think of it like this: every AI model, API Service, and coding platform out there is a Lego block. We simply have to connect those blocks to build something much greater than the sum of its parts



The Building Blocks

So let’s sketch out a blueprint for our Universal Translator. The first thing we need to do is outline our desired input and output.

Well, that’s simple. Our input is a stream of spoken words. And our output is a stream of translated, spoken words.

All that’s left to do is figure out exactly what Lego blocks go in between that input and output:



Here’s the pipeline I propose:

Most translation AI right now (circa 2023) is text-to-text. Meaning, these AI models are built to receive text in a given language as input, and then they return text as output as well. Nowhere does the translation AI handle audio. So if we’re going to make an audio-to-audio translator, we’re going to have to convert the audio into text first.

And that’s how we arrive at the following pipeline:

  • We use Deepgram’s Speech-to-Text AI to transcribe the spoken audio input into text. This text will be written in the speaker’s original language. That is, if the speaker says something in English, the AI will write those words down in English.

  • We use a Translation AI to convert this transcription into our target language. For our sake, let’s use French.

  • Finally, we use a Text-to-Speech application to say the translated words aloud.

Overall, our Lego-block pipeline should look like this:

And if we were to break down the “Transcribe & Translate” box, we’d have something like this:

And just like that, we have our pipeline! We’ve sketched out which technological lego-blocks we want and where to put them. All that’s left is to get our hands dirty and actually code this thing up.

Getting our hands dirty

Alright, here are the exact steps and technology we’ll be using to build our universal translator. Code appears below as well. But if you want to see everything at once, check out this notebook!

But alright, below, you’ll see the code for the following components of our Translator:

  1. Turning input audio into a transcript.

  2. Turning a transcript into translated text.

  3. Turning translated text into output audio.

Got it? Let’s go!

Turning input audio into a transcript

While there are many options for speech-to-text out there, not many of them are actually good. OpenAI’s Whisper has issues with any audio that isn’t perfectly polished and well-microphoned. Meanwhile, anyone who has seen the shoddy quality of auto-generated YouTube captions knows first-hand the abysmal error rate of Google’s default speech-to-text technology.

As a result, we’re using Deepgram, whose latest Nova model quite literally outshines every other API out there.

Thankfully, using Deepgram’s API is extremely simple! The accompanying SDKs and documentation give you code that you can literally copy and paste into your project. The result looks like this:

from deepgram import Deepgram
import asyncio
import aiohttp
import time


# Your personal API key
DEEPGRAM_API_KEY = '🔑🔑🔑 Your API Key here! 🔑🔑🔑'


# URL for the real-time streaming audio you would like to transcribe
URL = 'http://stream.live.vc.bbcmedia.co.uk/bbc_radio_fourlw_online_nonuk'


# Fill in these parameters to adjust the output as you wish!
# See our docs for more info: https://developers.deepgram.com/documentation/
PARAMS = {"punctuate": True,
         "numerals": True,
         "model": "general",
         "language": "en-US",
         "tier": "nova" }


# The number of *seconds* you wish to transcribe the livestream.
# Set this equal to `float(inf)` if you wish to stream forever.
# (Or at least until you wish to cut off the function manually.)
TIME_LIMIT = 30


# Set this variable to `True` if you wish only to
# see the transcribed words, like closed captions.
# Set it to `False` if you wish to see the raw JSON responses.
TRANSCRIPT_ONLY = True


'''
Function object.


Input: JSON data sent by a live transcription instance, which is named
`deepgramLive` in main().


Output: The printed transcript obtained from the JSON object
'''
def print_transcript(json_data):
   try:
     print(json_data['channel']['alternatives'][0]['transcript'])
   except KeyError:
     print()


async def main():
   start = time.time()
   # Initializes the Deepgram SDK
   deepgram = Deepgram(DEEPGRAM_API_KEY)
   # Create a websocket connection to Deepgram
   try:
       deepgramLive = await deepgram.transcription.live(PARAMS)
   except Exception as e:
       print(f'Could not open socket: {e}')
       return


   # Listen for the connection to close
   deepgramLive.registerHandler(deepgramLive.event.CLOSE,
                                lambda _: print('✅ Transcription complete! Connection closed. ✅'))


   # Listen for any transcripts received from Deepgram & write them to the console
   if TRANSCRIPT_ONLY:
       deepgramLive.registerHandler(deepgramLive.event.TRANSCRIPT_RECEIVED,
                                 print_transcript)
   else:
       deepgramLive.registerHandler(deepgramLive.event.TRANSCRIPT_RECEIVED, print)


   # Listen for the connection to open and send streaming audio from the URL to Deepgram
   async with aiohttp.ClientSession() as session:
       async with session.get(URL) as audio:
           while time.time() - start < TIME_LIMIT:
               data = await audio.content.readany()
               if data:
                   deepgramLive.send(data)
               else:
                   break


   # Indicate that we've finished sending data by sending the customary
   # zero-byte message to the Deepgram streaming endpoint, and wait
   # until we get back the final summary metadata object
   await deepgramLive.finish()


await main()

This code allows you to transcribe the audio from the source specified by `URL`. This program also allows you to customize how long you want it to run. The variable `TIME_LIMIT` specifies exactly how many seconds you wish to run the transcription AI.

If this step succeeds for you, the output should look something like this (with a different radio broadcast, of course)

Turning a transcript into translated text

For simplicity’s sake, we’re just going to showcase what the translation function looks like. For a complete display of the code, take a look at this notebook.

Long story short, we’re using the Google Translate API. All we have to do to fire it up is (1) initialize the translator, and (2) tell the translator what languages we’re working with—both input and output.

import googletrans
from googletrans import Translator


translator = Translator()


def print_translation(json_data):
   try:
     line = json_data['channel']['alternatives'][0]['transcript']
     translation = translator.translate(line, src = 'en', dest = 'fr')
     print(translation.text)
   except KeyError:
     print()

This `print_translation()` function is meant to replace the `print_transcript()` function in the previous block of code. As you can see, instead of simply printing the transcribed `line` from the `json_data`, we are instead passing that line through the translator. We outline that the source language `src` is English (`’en’`), and the destination language `dest` is French (`’fr’`).

We then replace any calls to `print_transcript` with a call to `print_translation` and we’re golden!

deepgramLive.registerHandler(deepgramLive.event.TRANSCRIPT_RECEIVED,
                                 print_translation)

Turning translated text into audio output

Sadly, there doesn’t seem to be a stream-to-stream, text-to-speech API just yet. That is, our output from the previous lego block is a stream of translated text. And our desired output is a stream of audio, reading the translation in real-time. So we’re going to have to make a compromise.

We can either (1) print our translated text into a document, and then feed that document into a speech-to-text API, or (2) take our translated text and use a resource like Narakeet to produce audio files as output, instead of a stream.

Pick your poison. Either way, however, the results are fun. Check it out: 



Conclusion

And that’s it! For real-time translation, the best we can do for now is to create real-time translated subtitles. But if we want to build a proper Star Trek-esque Universal Translator, we’re going to have to deal with some latency when producing an audio output.

Nevertheless, if there's one takeaway I wish to impart, it’s this: If you wish to become the next Iron Man—or, at least, if you wish to build a couple apps—you’ll often find that the technology already exists! You simply have to figure out how to link those AI-models and APIs together.

I’m already planning out another super-secret AI project. It also involves speech-to-text models (read: AI ears), but if you want a hint, check this out. 😉

And, as always, qatlho’ 🖖

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo