Article·Tutorials·Aug 23, 2022

Topic Detection in Podcast Episodes with Python

Tonya Sims
By Tonya Sims
PublishedAug 23, 2022
UpdatedJun 13, 2024

Imagine you’re a Python Machine Learning Engineer. Your work day is getting ready to start with the dreaded stand-up meeting, but you're looking the most forward to deep diving into topic detection algorithms.

Important Note

If you want to see the whole Python code snippet for topic detection, please scroll to the bottom of this post.

You step out to get coffee down the street. A black SUV pulls up next to you, the door opens, and someone tells you to get in the truck.

They explain that your Machine Learning Python prowess is needed badly.

Why?

They need you to transcribe a podcast from speech-to-text urgently. But not just any podcast. It’s Team Coco’s podcast, the legendary Conan O’Brien. Not only do they need it transcribed using AI speech recognition, but they also require a topic analysis to quickly analyze the topics to discover what the podcast is about.

They can’t say too much about the underground Operation Machine Learning Topic Detection, other than if you can’t deliver the topic modeling results or tell anyone, something terrible may happen.

Weird. Ironic but weird. Yesterday, you learned about the TF-IDF (Term Frequency - Inverse Document Frequency) topic detection algorithm.

You should feel confident in your Python and Machine Learning abilities, but you have some reservations.

You think about telling your manager but remember what they said about something terrible that may happen.

You’re going through self-doubt, and most importantly, you’re not even sure where to start with transcribing audio speech-to-text in Python.

What if something bad does happen if you don’t complete the topic detection request?

You decide to put on your superhero cape and take on the challenge because your life could depend on it.

Discovery of Deepgram AI Speech-to-Text

You’re back at your home office and not sure where to start with finding a Python speech-to-text audio transcription provider.

You try using Company A’s transcription with Python, but it takes a long time to get back a transcript. Besides, the file you need to transcribe is over an hour long, and you don’t have time to waste.

You try Company B’s transcription again with Python. This time, the transcription comes back faster, but one big problem is accuracy. The words in the speech-to-text audio transcript you’re getting back are inaccurate.

You want to give up because you don’t think you’ll be able to find a superior company with an API that provides transcription.

Then you discover Deepgram, and everything changes.

Deepgram is an AI automated speech recognition voice-to-text company that allows us to build applications that transcribe speech-to-text.

You loved how effortless it is to sign up for Deepgram by quickly grabbing a Deepgram API Key from our website. You also immediately get hands-on experience after signing up by trying out their console missions for transcribing prerecorded audio in a matter of a few minutes.

There’s even better news!

Deepgam has much higher transcription accuracy than other providers, and you receive a transcript back super fast. You also discover they have a Python SDK that you can use.

It’s do-or-(maybe)-die time.

You hear a tornado warning siren, but disregard it and start coding.

You won’t let anything get in your way, not even a twister.

Python Code for AI Machine Learning Topic Detection

You first create a virtual environment to install your Python packages inside.

Next, from the command line, you pip install the following Python packages inside of the virtual environment:

Then you create a .env file inside your project directory to hold your Deepgram API Key, so it’s not exposed to the whole world. Inside of your .env file, you assign your API Key from Deepgram to a variable `DEEPGRAM_API_KEY, like so:

Next, you create a new file called `python_topic_detection.py. You write the following code that imports Python libraries and handles the Deepgram prerecorded audio speech-to-text transcription:

The transcribe_with_deepgram() function comes from our Deepgram Python SDK, located here in Github.

In this method, you initialize the Deepgram API and open our .mp3 podcast file to read it as audio. Then you use the prerecorded transcription option to transcribe a recorded file to text.

You’re on a roll!

Next, you start writing the code for the TF-IDF Machine Learning algorithm to handle the topic detection. The tornado knocks out your power, and you realize you only have 20% laptop battery life.

You need to hurry and continue writing the following code in the same file:

In this code, you create a new function called cleaned_docs_to_vectorize(), which will get the previous method's transcript and remove any stop words. Stop words are unimportant, like a, the, and, this etc.

The algorithm will then perform the TF-IDF vectorization using these lines of code:

You quickly read about the options passed into the vectorizer like max_features and max_df on sciki-learn.

You have a little bit on time with 15% battery life, so you decide to use K-Means to create 10 clusters of topics. This way, they can get a more meaningful sense of the data structure from the podcast. You write the K-Means clusters to a file called results.txt.

To run the program, type python3 python_topic_detection.py from the terminal.

When you print the topics, you see a list like the following:

Bingo!

You can now make inferences about the AI Topic Detection to determine the subject matter of the podcast episode.

Then, peek at your results.txt file to verify that you received 10 clusters. Here’s an example of four of the ten groups of words using KMeans clustering:

Just before your laptop battery dies, you show them the topics for Team Coco. They are very happy with your results and drive off.

You’re feeling more confident than ever.

You’ll never know why they needed the Machine Learning topic detection or why they chose you, but you’re on top of the world right now.

Conclusion

Congratulations on building the Topic Detection AI Python project with Deepgram. Now that you made it to the end of this blog post, Tweet us at @DeepgramAI if you have any questions or to let us know how you enjoyed this post.

Full Python Code for the AI Machine Learning Podcast Topic Detection Project

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.