From Klingon to Simlish: Deepgram's AI Language Detection Adventures
In many automated speech recognition (ASR) use cases, developers have the luxury of, pretty much, being able to anticipate with which languages their applications are going to conduct conversation. If you know your core audience and consumers, it’s easy to assume that you’ll be operating in a generally well-known language like English. But what happens when a developer can’t anticipate the language that their application is trying to comprehend and communicate with?
The internet succeeded in connecting a large portion of the world’s population, and even though English is the lingua franca of the web, when it comes to verbal communication language choice is literally all over the map. People who may not speak a well-known language are now empowered to engage more extensively with software that was previously accessible predominantly in more developed regions. The linguistic bandwidth of these ASR applications needs to be widened so that any language can be comprehended and dealt with. Enter AI language detection!
Language Detection with AI
Language detection is a classical machine learning problem that’s actually been around for a while. It revolves around the task of solving a classification problem, where the objective is to accurately identify the label (in this case, the language) of a given text or audio sample. A bunch of labeled data samples comprising a model’s training data are used in a process known as supervised learning, where the model will learn to predict the proper target label using the labeled training data.
Features (aspects of the text or audio used as training data) are extracted from this training data to pass to a model for training. In the case of text, these features can include character n-grams, word frequencies, or statistical properties like sentence length or punctuation usage.
For audio-based language detection, features such as Mel-frequency cepstral coefficients (MFCCs), which represent the spectral content of the audio signal, are commonly extracted. The MFCCs specifically filter for and represent frequencies that most mimic the cochlear function of the human ear. These features help differentiate linguistic nuances across different languages, making it possible for a model to discern what language it's seeing or hearing.
Once the features are extracted, they are used to train the model, which is usually some sort of deep neural network. The neural network learns to identify underlying patterns and correlations between the features and the corresponding language labels. It adjusts its parameters during iterations of training to minimize the difference between its predictions and the correct outputs. This way, the model becomes capable of generalizing its knowledge to accurately classify unheard audio samples.
Language detection can also be performed by large language models (LLMs) because they are trained on vast amounts of text data from multiple languages. This training process equips LLMs with a broad understanding of language structures and characteristics, enabling them to identify the language of a given text.
To demonstrate how this works in real time, let’s take a look at how the Deepgram API can perform language detection using its classification endpoint. However, we’re going to have a little bit of fun and see how the model reacts to a curveball. Instead of passing actual languages to the API, we’re going to be using conlangs, or constructed languages. These are fictitious languages that you may have encountered in various pop culture mediums such as film, television, video games, etc. We’re going to throw samples of these conlangs to the Deepgram API’s language detection endpoint and see what language it thinks it's hearing. If we’re lucky, the results should be pretty interesting!
Now For Some Code
Deepgram’s API comes with the ability to perform language detection on audio samples to determine the dominant language spoken in the audio. It transcribes the audio in the identified language as well as returning the language code it classifies the sample as in a JSON response. You can call the API via curl or the Python SDK. We’ll be using the Python SDK to try and classify our conlangs.
We obtained samples of audio from Youtube of five fictitious languages: Animalese from Animal Crossing, Simlish from the Sims, Hylian from the Legend of Zelda, Valyrian from Game of Thrones, Klingon from Star Trek, and Mordor Black Speech from the Lord of the Rings. Those samples were then split into 15 second slices of audio to observe any changes the language detection mechanism might make over the entirety of the audio sample. If you’re trying to replicate this project, feel free to use the same samples in the colab notebook linked below, or obtain your own samples and split them using an audio splitter online. We used a Youtube to MP3 online file converter to acquire our MP3 samples, of which there are plenty on the internet.
First, we’ll open up a terminal window and make sure the following dependencies are installed using pip so we can use the Python SDK to call the API.
Next, we’ll make a project directory with the following structure and files (don’t worry about the files inside the “fake_language_json” directory; those will get populated upon running the script):
The API returns a JSON response with the transcription of the audio along with other relevant information that can be requested via parameters to the API. This information includes metadata, duration of sample, model information (we’ll be using Deepgram’s new state of the art foundation model Nova), confidence scores of all the words, along with- the most pertinent one for this project - the detected language of the audio.
Finally, make sure you sign up for Deepgram’s API and generate your API key. Make sure to store it somewhere safe, and if you’re pushing code to any public repositories to never commit your key for security purposes. Instead store it in a “secret_keys.py” file, as an environment variable, or write it down somewhere and store it as an environment variable. We’ll go ahead and open our secret_keys.py file in the conlangs project directory and add the API key so that we can reference it in the script.
Now, we can get started on writing our language detection script! Open up `fake_language.py`. We’ll start by importing the required packages into the file. Here, `secrets_vault` is just a separate file holding the API key.
Next, we’ll do a little bit of initialization so we can call the API. We’ll set up the Deepgram Python client using the API key and declare some top-level scope variables to be used by the API and the script.
You can change the directory to the location of wherever your respective audio samples are. Here, we have it pointing to the directory containing the split audio samples of each of our conlangs. The API parameters specified allow us to use the Nova model along with making sure that the language detection feature is enabled during the request (the “punctuate” parameter also is enabled by default during language detection so that’s why it’s included here).
Next, we’ll go ahead and create the output directory that will hold all of our JSON responses. If the directory already exists, then there is no need to replace it.
Now, we can finally write our API call. We’ll loop through the audio samples directory to collect our inputs, pass them to the API, and store the result in JSON files in our output directory. The Deepgram transcription API can transcribe audio for both pre-recorded and streaming audio. We’ll obviously be using the pre-recorded functionality, but the API also can handle either a buffer or a URL source of the audio. Since we have downloaded our conlang samples to our machine, we’ll be calling the API as such:
So now we have written the function that will call the API and store the response in a JSON file we can access to see the detected language. We’ll also write up a helper function to extract the language of each sample and nest it in a dictionary.
Finally, in our script, we’ll go ahead and call our API-requesting function and display the language detected for each of the audio samples.
Here’s the full script below:
Let’s run it and see what we get for our results! All language detection results are BCP-47 language tags.
So, it seems like Hylian sounds like a mixture of Chinese, Korean, and Japanese. Simlish was detected as a purely English-sounding conlang, as was the Black Speech of Mordor. Valyrian and Klingon had some English in there, as well as some other languages like Hindi, Italian, and Russian. Animalese was detected as a mix of mostly Japanese and Korean. To no surprise, the video games developed in East Asia have East Asian sounding conlangs! So comically interesting results aside, what did we observe while building out our conlang detection script?
We saw how exactly to use the Deepgram API, specifically how to enable language detection and parse the response of the API. We also talked a bit about how language detection works under the hood of the API. We also observed the comical results of what would happen should an ASR language detection API come in contact with a completely fictitious language composed of gibberish. Goofy use cases aside, if you have an idea for an application that needs to perform language detection or even just speech transcription, sign up to get your Deepgram API key and start building!
Note: If you like this content and would like to learn more, click here!
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.