Article·Tutorials·Oct 25, 2022

Parse Podcasts With Python: Understanding Lex Fridman’s Podcast With Deepgram ASR And Text Analysis

Yujian Tang
By Yujian Tang
PublishedOct 25, 2022
UpdatedJun 13, 2024

When it comes to podcasting, Lex Fridman is an expert in the craft: delivering extensive, in-depth interviews with fascinating people for a large and engaged audience.

A computer scientist and AI researcher at MIT by day, Fridman also maintains an active online media presence. This includes his YouTube channel, on which he’s racked up nearly 270 million views and over 2.1 million subscribers. And then there’s his podcast, episodes of which can run for over two hours. His interview-based show has featured guests from the computer science field like legendary TAOCP author Donald Knuth, technology entrepreneurs like Anaconda founder Peter Wang, and various media personalities ranging from Vsauce youtuber Michael Stevens to Joe Rogan.

It’s clear that he knows what he’s talking about, but using deep learning we can get a better understanding of what Fridman says, how he interacts with his interview subjects, and maybe get a better understanding of what it means to be a good podcaster.

In this article, we show how to use Deepgram's transcription and understanding API to extract and analyze what Fridman and friends say on the podcast, and perform further analysis with The Text API to pull out even more insights. We’ve structured the article to highlight our findings first, but don’t worry: there is plenty of technical explanation (and Python code snippets) in the second half. All the code for this project can be found on Github.

Let’s Talk about Lex (and What Makes a Podcast Good)

The Lex Fridman Podcast launched in 2018 with the tagline, “Conversations about the nature of intelligence, consciousness, love, and power.” As mentioned earlier, episodes follow a relatively consistent interview format, the majority of which feature Fridman and one other guest, though sometimes a third person joins in.

Using automatic speech recognition (ASR) and text analysis, we used a set of episodes between #300 (Joe Rogan) and #311 (Magatte Wade) to find out:

  • What Fridman talks about, and his most-used phrases

  • Number of words spoken on an episode-by-episode basis

  • Total talk time per episode and how it is split between Fridman and his guest

Remember, more technical explanations of how we arrived at these findings can be found in the second half of this article. But before getting to the “how” let’s start with what we found when we ran episodes from the Lex Fridman Podcast through Deepgram and The Text API.

People Versus Things

Let’s start with the basics. In the universe of possible things to talk about on a podcast, there are basically two categories: people, and things. Analyzing transcripts of the Lex Fridman Podcast shows that conversations about people slightly edge out over conversations about things. Here’s a chart generated with Matplotlib.

Given the interview format of Fridman’s podcast, it shouldn’t be too surprising that people are a hot topic of conversation. Although most episodes of Fridman’s podcast veer off into discussion of heady topics like artificial intelligence, robotics, biology, and other thing-related subject matter, he typically leads off the interview with questions to elicit details of his guests’ backgrounds, which contextualize their expertise and serve to make the following discussion all the more engaging.

Favorite Adjectives

Like many of us, Lex Fridman has a set of words that he uses more than others. Lex is known for saying things like “beautiful”, “poetic”, “loving”, and more. Does the transcript agree though?

Looking through Lex Fridman’s most common phrases from this folder on GitHub, we can see that he does use the words beautiful, loving, and poetic often enough that they get picked up! We can also see that a lot of the time, these conversations center around types of organizations and people.

The most common adjectives that Lex is known for show up in this order in his most common phrases:

Is this a particularly consequential finding? Not really, but it is a fun one, since it validates the perception that these are some of Lex’s favorite words. Perhaps it suggests a certain tendency toward awe-struck wonder on Fridman’s part, but we’ll leave further psycholinguistic speculation for another time.

Talk Time and Lexical Density

One of the advantages that podcasting has over conventional radio programming is that one is not constrained by the clock. Radio producers have to time their content perfectly to fit into a specific time slot—44 minutes of content, 12 minutes of sponsorship reads, and 4 minutes of FCC-mandated station IDs to make up one 60-minute program block, for example.

A podcast episode can be as long or as short as its host wants, and data show that installments of the Lex Fridman Podcast tend to be pretty long. The Matplotlib chart below shows talk time, in seconds, across the sample set of podcast episodes we analyzed.

In this sample we can see that most episodes average out at over 2 hours (7,200 seconds) of talk time. Looking at the overall catalog of the Lex Fridman Podcast, some episodes (such as his conversation with linguist Noam Chomsky) are as short as 45 minutes, while others (like the second time computer scientist Steven Wolfram appeared on the show) stretch for over four hours. This demonstrates an ability to hew to particular time constraints as needed, but a preference for extended conversations.

The above plot hints at something that’s made explicit in the next one: as an interviewer, Lex Fridman is good at letting his guests lead the conversation.

His share varies from episode to episode, but at least in this sample set, Fridman accounts for roughly one third of the talk time on average. This, in short, is what makes the interview format work for the Lex Fridman Podcast. Like any decent interviewer, Fridman prompts his guests with open-ended questions and uses follow-up and transitions to create a cohesive conversational arc where guests’ perspectives are at the forefront.

All this being said, it’s worth noting that although Fridman is generous with giving guests time to speak expansively, when it comes to the sheer number of words spoken, Fridman and his guests are more evenly split.

Some episodes may contain a short novel’s worth of words, and Fridman is usually responsible for saying at least 40 percent of them.

This suggests that Fridman speaks quickly (at least relative to his guests) or uses a lot of shorter words when speaking, and, most likely, both factors influence Fridman’s share of total word count.

Project Walkthrough: How We Analyzed The Lex Fridman Podcast

This was a pretty immersive project that relies on several different tools and API endpoints. This half of the article walks through the process of extracting, parsing, and processing podcast audio undertaken to arrive at the results presented in the first half of the article.

Here’s the technical explanation and Python snippets promised in the introduction.

Step 0: Prepare Dependencies and Prerequisites for Automatic Podcast Transcription

Although his podcast is also distributed via RSS, the original podcasting protocol, Fridman—like many newcomers to the ‘casting game—rely on Youtube as their primary distribution channel.

To extract and process audio from Fridman’s Youtube channel, you’ll need two different web API tools, the youtube_dlrequests, and matplotlib libraries, and FFMPEG. You can get FFMPEG in multiple ways. If you are on a Mac you can brew install ffmpeg. If you’re on a Windows or Linux machine, you can find FFMPEG here.

The Python libraries can be installed with pip install youtube_dl requests matplotlib. You can also use conda if that’s your package manager. We also need to get API keys for Deepgram for online transcription and The Text API for online text analysis. If you only plan to use the FFMPEG library in Python, you can also run pip install ffmpeg-python for the Python library.

Step 1: Download Podcast Audio from Youtube

The first thing we need to do, as with most any ML task, is get the data. In this case, our data is the podcast audio. This is the part of the project that requires the use of youtube_dl and ffmpeg. We could go look for podcasts on another site, but luckily Lex posts all of his to YouTube and we have a YouTube downloader.

We start by importing youtube_dl into our project. Next, we declare the parameters for downloading our YouTube videos. Our main goal here is to get the audio files, so we want the options to reflect that.

We choose bestaudio/best as our format to indicate that we want to download the best audio format. Then, we apply postprocessors which indicate what to do once we have the webpage. In this case, we tell youtube_dlthat we want to use FFMPEG and extract the file into an mp3 file at 192 kHz. Finally, the last option we provide is where to save the output. In this example, I’m saving it to lex/audio.

Now all that’s left to do is call the YouTube downloader with the link to the first video in Lex Fridman’s podcast playlist. This will automatically run down the list and also download the later videos in the playlist. We stop after downloading 10 in this example due to the size of the podcasts. We leave analyzing the other 300+ 2 hour-long podcasts as an exercise for the reader.

Step 2: Use Deepgram to Get Podcast Transcriptions with Speaker Diarization

Step 1, get the data? Check. Step 2: Automatically transcribe podcasts with speaker diarization. We use the Deepgram SDK to run transcription on a prerecorded piece. The other libraries we need for this part are the asynciojson, and os libraries. I’ve stored my API key in a separate config file and loaded it in by importing it.

The first thing we need to do is initialize a Deepgram client with our API key. Next, let’s set up some options. The main options we want to set up here are diarizepunctuate, and paragraphs. The diarize and punctuate arguments help us get a podcast transcription with speaker diarization back. We use paragraphs to get separate paragraphs back separately. Click here for more info about transcription options.

model and tier define the model type and tier that we want to use. Deepgram has multiple models from a general model to ones suited for meetings, phone calls, financial vocabulary, and more. For our purposes, we’ll use the enhanced General model to take advantage of its high accuracy on long tail vocabulary we commonly see in Lex’s content. Next, we set up an async function. The Deepgram Python SDK currently uses an async interface for both their real-time and pre-recorded audios. A synchronous version for the pre-recorded audio type is coming.

In our function, we use the os module to get a list of the filenames in the folder we’ve stored the audio in. Next, we loop through that list. For each of those podcasts, we read the podcast in as bytes and pass that to the Deepgram SDK along with the options we created earlier. We await the response and store it in a response variable.

Once we’ve received the response, we dump that into a JSON file and save it. We’ll use these JSON files as our data sources for text analysis. Since we have an async function, we run it by using the async.run() function which executes the async loop for us.

Step 3: Pretty Print the Lex Fridman Podcast Transcripts

Now that we have the podcast transcripts with speaker diarization, let’s make them look pretty. You can see what the raw transcripts look like in the GitHub folder. For this portion, we need to use the json and os libraries. We create two functions to turn the podcast transcript into a pretty script.

First, we create a function to create the pretty transcripts from the speaker diarized transcript output we got. Second, we create a function that assigns the speaker names to the labeled speakers.

Step 3.1: Extract the Podcast Transcript for Pretty Printing

Let’s create our first function. This function takes no parameters. It acts on all the transcripts in one go. First, we use the os.listdir() function to get all the transcripts. For each of these transcripts, we load the JSON file into a local variable for storage.

Next, we need to parse the transcript. As shown above, the JSON format of the transcript has many embeds and has more information than we need. All we need from the JSON podcast transcript is the actual words of the transcript, we can disregard confidences and such.

An example template of a JSON response from Deepgram with paragraphs and words looks like this:

We dive into the results returned. Then the first element returned in the channels. Then the first element returned in the alternatives. From there, we extract the paragraph key for the set of paragraphs. Remember we used the paragraphs = True option earlier.

From the paragraphs, we extract the speaker diarized transcript. Once we have that transcript, we loop through it and write each line to a new file. To create all the pretty transcripts, we just run the create_transcripts function. Another approach would be to have the loop of transcripts outside of the function and have the function just create one pretty script from a file.

Step 3.2: Assign Names to Recognized Speakers from Speaker Diarization

Now that we’ve extracted the pretty version of the podcast transcript, let’s assign speakers. Deepgram’s speaker diarization feature split the speakers out for us, but we have to assign their names. Just like with the last function, this function acts on all the files at once.

First, we go into the directory where we saved all the pretty printed podcast transcripts. For each one of these, we start by printing which file we’re on so we have an idea of the expected speakers. These 10 Lex Fridman podcasts each have two speakers, Lex and the Guest.

We use two lists to keep track of the speakers. One list keeps track of the speaker by number (created by the speaker diarization feature). The other keeps track of the speaker names. For each of the lines in the transcript that starts with “Speaker” we know that there is a diarized speaker there.

We prompt the user (us) to manually enter the name of the speaker as we go through all the speakers. If a speaker has already been added to the spoken list, which keeps track of the numbered speakers, we skip it.

When we finish labeling the numbered speakers, we go in and find and replace all the “Speaker” strings with the corresponding name. Finally, we write that to the same document.

Step 4: Time Spent Speaking by Lex vs. Guests

Now that we have a pretty podcast transcript with speaker diarization, we can do some more advanced analysis. Let’s take a look at how often Lex speaks versus how often his guests speak. For this analysis, we use the json and os libraries as well. We are going to analyze the length of time spoken and the number of words said.

Step 4.1: Assigning Speaker Time and Words Said from the Podcast Transcript

First, we are going to analyze the length of time spoken by each member. The first part of this doesn’t require the pretty transcripts we made earlier. We need access to the originally returned podcast transcriptions.

The first thing we do in this section is the speaker labeling we did above. The main difference is that we keep track of the start and end time of each paragraph said. Once we have these paragraphs, we add up all the times a speaker spoke and store that as the value in a dictionary with the speaker’s name as the key. Then we dump the resulting dictionary into a JSON file. We use this JSON file to visualize the results later.

Step 4.2: Finding the Number of Words said by each Speaker

In the words_said function, we take the pretty podcast transcript and count the words from each speaker. Outside of just counting the words, we also need to keep track of the current speaker. While looping through the transcript, if any line doesn’t start with a speaker name, we add it to the last speaker. Once we’ve separated all the speakers, we dump the resulting dictionary into a JSON file.

Step 5: Use Matplotlib to Visualize Lex Fridman’s Podcast by Speaker

Now comes the part where we plot the values. For this part, we need the matplotlib and json libraries. As displayed above in the “findings” part of this article, we plotted both the word separation and the time analysis.

Step 5.1: Visualize Time Speaking and Words Said

First, we plot the time speaking graph. We opened up the graph and load in the JSON file as a dictionary. Next, we split the dictionary into values for time spoken by Lex vs the Guest as well as create a list containing the names of the guests. We passed these lists to matplotlib and give it some options to make the graph pretty and to stack the bars.

Note that in the second ax.bar call, we include the values in the first call to Lex’s speaking time with the parameter bottom so we can stack the guest’s time on top of that in the bar. We set a legend, the bar size, tell the plot to make the bar labels angled, set the title and axes, save the plot to an image, and then show it. The time speaking image is shown above.

Once we create the time speaking graph, it’s the same process to create the words said graph. We only need to change the labels and title. Everything else remains the same. The words speaking graph is at the bottom of this section. The images show that Lex is probably a faster speaker than most of his guests. He says a similar amount of words in each show, but speaks for less time.

Step 5.2: Visualize Time Speaking and Words Said by Percentage

Since Lex’s podcasts are all different lengths, we figured it would be helpful to also check out percentages. So here are the corresponding graphs but with the percentage of time spoken and percentage of words said. As you can see, Lex averages about 40% of the words said, but only 25% for the length of time speaking.

The code to create the visualizations is below. The only functional difference from the earlier visualizations is the three lines to convert the values into percentages. First, we sum the values in the time or word tracking dictionary. Next, we re-assign the value of each key to its value divided by the sum. This gives us a proportion to measure against.

Step 6: Use The Text API to Identify Common Phrases

Next, let’s take a look at the most common phrases mentioned in each podcast. This part is where The Text API comes in. For this section, we need the json and os libraries again. We also need the requests library to send API requests and our Text API key. The first thing that we do is create the headers to send off to the API with our API key.

We create one function. I call it nlp below, but feel free to call it whatever makes sense to you. This function takes no parameters. Just like the functions above, one call loops it through all the podcasts we are working on. This one starts by opening up the pretty printed podcast transcripts we created earlier.

We separate the speakers by looping through each line of the podcast. Similar to what we did when we added up the number of words said, we’re going to coalesce all the words into a single blob of text for the speaker. Once we have all those words together, we call the Text API.

To call the text API, we need a body to send as the JSON argument of the request. We create a body and assign all the text of a speaker to the text key. The most_common_phrases endpoint also allows a num_phrases option to pass in that tells the API how many phrases to return. The default is three, here we’ll do five. Once we get our request back, synchronously this time, we write the results to a text file.

Step 6.1: Use Matplotlib to Visualize Phrases and Subject Data

I plotted these using a similar method to the time speaking visualizations. First we import the matplotlib and os libraries. Then we create a list of the files that specifically are Lex Fridman’s most common phrases. From there we create two dictionaries to represent the words we are looking for. In this case, the adjectives “beautiful”, “poetic”, “fascinating”, and “loving” as well as the subjects “things” and “people”.

Next, we establish some constants for what we want to call the graph and how we want to label it. Now, let’s make our graph visualization function. The main differences between this code and the other visualization code above lies in its customizability. Also since we aren’t stacking bars, we don’t need a second axis to plot a second set of data.

Step 7: Use The Text API to Extract Named Entities from a Transcribed Podcast

We can do further processing on the text from the transcribed podcast as well. One of the most popular text analysis techniques for understanding a transcription is Named Entity Recognition (NER). Extracting and analyzing the named entities will tell us more about the who, what, when, and where that these podcasts focus on.

Due to the size of these transcripts, we’re going to split them up for faster processing. You can likely process each transcript by itself, but my internet speed doesn’t keep my connection alive for the response. The Named Entity Recognition function takes two parameters. One is the text being analyzed, and the other is the filename to save to.

We split these texts into (max) 1500 sentence blocks (roughly 2 hours of talking or 10 MB of data). Then we will process each one by sending it to The Text API’s NER endpoint and waiting on the response. We have to run these sequentially because we are sending requests synchronously. If we send them asynchronously, we can run the requests in parallel. Once we get the results back, we save them to a file.

We use a main() function to run the NER. You can opt to just run the function itself, but this is how I opt to run NER on each of the files. We re-use this same main function below when getting the summaries of each of the podcast transcripts.

Step 8: Use The Text API to Generate Summaries by Episode

The last piece of analysis that we do in this post is going to look at the summaries of the podcasts we transcribed. Summaries provide a quick look at an overview of the given text. In this case, it’s especially nice when we don’t want to read 2+ hours of a podcast transcript. We use The Text API’s AI text summarizer to create a summary of each podcast.

We create a summary in much the same way we extract the named entities. We split the text file up, again into blocks of 1500 sentences max, and then summarize them and smash them together. We run the requests sequentially and then store the returned values into a text file. You can read all the summaries of the transcribed podcasts on GitHub.

Summary of Automatic Speech Recognition + NLP for Podcast Analysis

Whew, that’s a lot of podcast analysis. We started this project by using youtube_dl and ffmpeg to get the audio of many episodes of Lex Fridman’s podcast from YouTube. Armed with the audio files to his podcasts, we went to Deepgram to get our podcast transcription. After successfully transcribing his podcasts, we run some analysis.

The first thing we did was take a look at talk time and words said. We plot our findings for those, which is that Lex lets his guests talk for much more time, but says a similar number of words. The takeaway? Either Lex says a bunch of small words or talks fast.

Next, we used text analysis with The Text API to analyze the podcast transcripts further. First, we used it to find the most common phrases. This finding led us to confirm that Lex does indeed use the words “beautiful”, “poetic”, and “loving” a lot. After finding the most common phrases, we went and looked at how to extract the named entities and get the summaries.

To recap: our podcast transcription + NLP analysis gave us a glimpse into the structure of the legendary podcast series by Lex Fridman. It showed us that Lex lets his guests talk much more than he does, and that he does use the words he’s known for using a lot.

Feel free to derive your own conclusions from the provided NER files and summaries from the GitHub repo.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.