Article·Announcements·May 11, 2023

Trained on 100,000+ Voices: Deepgram Unveils Next-Gen Speaker Diarization and Language Detection Models

Josh Fox
By Josh Fox
PublishedMay 11, 2023
UpdatedJun 13, 2024

TL;DR:

  • Deepgram’s latest diarization model offers best-in-class accuracy that matches or beats its best alternatives. In addition to outstanding accuracy, our language-agnostic diarization model processes audio 10 times faster than the nearest competitor.

  • We’ve revamped our automatic language detection feature resulting in a relative error rate improvement of 43.8% over all languages and an average 54.7% relative error rate improvement on high-demand languages (English, Spanish, Hindi, and German).

Reducing confusion where it counts: New Speaker Diarization

Today, we’re releasing a new architecture behind our speaker diarization feature for pre-recorded audio transcription. 

Compared to notable competitors, Deepgram’s diarization delivers:

  • 53.1% improved accuracy overall from the previous version, meeting or exceeding the performance of our best competitors.

  • 10X faster turnaround time (TAT) than the next fastest vendor.

  • Language-agnostic support, unlocking accurate speaker labeling for transcription use cases around the globe. Language-agnostic operation means that no additional training is required or changes in performance will occur as Deepgram expands language support in the future. 

Speaker diarization—free with all of our automatic speech recognition (ASR) models, including Nova and Whisper—automatically recognizes speaker changes and assigns a speaker label to each word in the transcript. This greatly improves transcript readability and downstream processing tasks. Reliable speaker labeling enables assignment of action items, analysis of agent or candidate performance, measurement of talk time or fluency, role-oriented summarization, and many additional NLP-powered operations that add substantial value for our customers’ customers. 

Our speaker diarization feature stands out with its ability to work seamlessly across all languages that we support, a key differentiation from alternative solutions, and especially valuable in multilingual environments such as contact centers and other global segments where communication frequently occurs in multiple languages. 

Today’s performance improvements provide significant benefits to end users in two ways. First, these upgrades significantly reduce costly manual post-processing steps, such as correcting inaccurately formatted transcripts. Second, they increase operational efficiency by accelerating workflows, now enabled by faster turnaround times for diarized ASR results. From improved customer service in call centers to better understanding of medical conversations, this leads to better outcomes and increased ROI for your customers. 

Our Approach: More voices, without compromise

At a high level, our approach to diarization is similar to other cascade-style systems, consisting of three primary modules for segmentation, embeddings, and clustering. We differ, however, in our ability to leverage our core ASR functionality and data-centric training to increase the performance of these functions, leading to better precision, reliability, and consistency overall. 

Speaker diarization is ultimately a clustering problem that requires having enough unique voices in distribution in order for the embedder model to accurately discriminate between them. If voices are too similar, the embedder may not differentiate between them resulting in a common failure mode where two distinct speakers are recognized as one. Out-of-distribution data errors can occur when the training set was not sufficiently representative of the voice characteristics (e.g. strong accents, dialects, different ages in speakers, etc.) encountered during inference. In this failure mode, the embedder may not produce the right number of clusters and the same speaker may be incorrectly given multiple label assignments.

Our diarization accurately identifies speakers in complex audio streams, reducing such errors and resulting in more accurate transcripts through a number of intentional design choices in our segmentation, embedding, and clustering functions and our groundbreaking approach to training. 

To overcome even the rarest of failure modes, our embedding models are trained on over 80 languages and 100,000 speakers recorded across a substantial volume of real-world conversational data. This extensive and diverse training results in diarization that is agnostic to the language of the speaker as well as robustness across domains, meaning consistently high accuracy across different use cases (i.e. meetings, podcasts, phone calls, etc.). That means that our diarization can stand up to the variety of everyday life — noisy acoustic environments, ranges in vocal tenors, nuances in accents — you name it.

Our large-scale multilingual training approach is also instrumental to achieving unprecedented speed, since it allows us to employ fast and lean networks that are dramatically less expensive compared to current state-of-the-art approaches, while still obtaining world-class accuracy. We coupled this with extensive algorithmic optimizations to achieve maximal speed within each step of the diarization pipeline. As with Deepgram’s ASR models writ large, the value of our diarization is speed and accuracy. There is no limit to the number of speakers that can be recognized (unlike most alternatives), and the system even performs well in a zero-shot mode on languages wholly unseen during training.

Figure 1: Deepgram Diarization System Architecture 

Performance: Benchmarking Deepgram’s Diarization Accuracy and Speed

In typical Deepgram fashion, we tested our own and competitors’ diarization models using a thorough scientific assay. Unlike other benchmarks based on pre-cleaned audio data from a select few sources, our comparisons were made with the help of over 250,000 human-annotated examples of spoken audio pulled from real-life situations across a diverse set of audio lengths, domains, accents, speakers, and environments. This provides a representative real-world evaluation of expected performance in production across audio domains.

Using these datasets, we calculated the time-based Confusion Error Rate (CER)[1] and compared it to other popular models. Our latest diarizer improved in median CER by 53.1% from the previous version across all three domains, and for each domain improved by:

  • Meeting: 61.5%

  • Podcast: 72.7%

  • Phonecall: 48.5%

Our benchmarking was conducted across three discrete audio domains–meeting, video/media/podcast, and phone call. Our speaker diarization feature outperforms many commercial diarization models and common open-source alternatives like Pyannote when dealing with domain-specific, real-world data. This highlights Deepgram's outstanding accuracy independent of use case and establishes it as a leading choice for diverse speech recognition applications.

Figure 2: Comparison of average time-based confusion error rate (CER) of our new diarization feature with other popular alternatives across three audio domains: meeting, video/media/podcast, and phone call. It uses a boxplot chart, which is a type of chart often used to visually show the distribution of numerical data and skewness. The chart displays the five-number summary of each dataset, including the minimum value, first quartile (median of the lower half), median, third quartile (median of the upper half), and maximum value.

Speed: Undoubtedly the Fastest Solution in the Market

To gauge our new diarization model’s performance in terms of inference speed, we compared the total turnaround time (TAT) for ASR + diarization against leading competitors using repeated ASR requests (with diarization enabled) for each model/vendor in the comparison. Speed tests were performed with the same static 15-minute file.

We compiled the aggregate inference speed statistics from the resulting turnaround times (TAT) and compared how fast Deepgram ASR with the new diarization completed the same 15-minute diarized transcription task as each alternative did. Deepgram’s ASR + diarization outperformed all other speech-to-text models, delivering inference times at least an order of magnitude faster than the next fastest competitor and 44 times faster than the slowest competitor in our comparison.

Deepgram’s language agnosticism and improvement in accuracy do not come at the expense of speed as demonstrated by the TAT comparison shown in Fig. 3 below.

Figure 3: Normalized turnaround time (TAT) for transcription with diarization using the same 15-minute audio file. To ensure a fair comparison, results are based on the median TAT of repeated trials for each vendor.

Speaking your language: Accurately detecting and transcribing the languages you need

But wait, there’s more! We’ve also revamped our automatic language detection feature that enables customers to automatically detect the dominant language in an audio file and automatically transcribe the output in the detected language. Deepgram's new language detection feature provides unparalleled accuracy in detecting and transcribing audio data in over 15+ languages and dialects, including English, Spanish, Hindi, Dutch, French, and German.

In comprehensive testing, the updated language detection feature delivers:

  • 43.8% relative error rate improvement over the previous version across all languages.

  • 54.7% relative error rate improvement on high-demand languages (English, Spanish, Hindi, and German).

  • Inference speed that is on par or faster than the previous version across a range of audio durations.

Deepgram's updated language detection feature delivers greater accuracy and reliability in detecting the dominant language in a variety of audio domains including conversational AI, podcasts, meetings, customer service calls, and sales interactions. For customers serving end users in multilingual environments, this critical functionality enables them to offer personalized customer experiences, serve diverse audiences by routing calls to agents who speak the same language as the customer, improve accessibility for podcast and media companies, and fulfill compliance and regulatory requirements that dictate business communications take place only in approved languages.

Performance: Measuring Deepgram’s Automatic Language Detection Accuracy Improvement

To compare our new language detection against our previous version, we ran a collection of 1000 short (10-30 second), real-world audio samples per supported language. We measured the resulting detection error rates to calculate the relative error rate improvement for each supported language. The results show relative improvements ranging between 15.9 - 63.8% for our new language detection feature over the previous version across 15 benchmarked languages, with notable jumps in accuracy (54.7% relative error rate improvement) for high-demand languages (English, Spanish, Hindi, and German). In absolute terms, error rates dropped across the board and range from 1.8% to 6.7% (see Fig. 4). In direct comparisons with Whisper Medium, our language detection had a 6.7% lower error rate when averaged over all tested languages.

Figure 4: Language detection error rate by supported language.

There’s more to come

Deepgram knows that continuous improvement is the key to building a customer-obsessed company, as evidenced by our recent introduction of Nova, our next-gen speech-to-text model that beats all competitors in speed, accuracy, and cost. We said we refused to rest on our laurels, and today’s announcement is indicative of our commitment to never ending progress.

Deepgram is dedicated to delivering practical language AI solutions for businesses, and with our new and improved diarization and language detection features, we add new value to our customers through significant improvements in accuracy, while maintaining best-in-class speed performance.

At Deepgram, we believe that speech is the hidden treasure within enterprise data, waiting to be discovered. Our mission is to make world-class language AI not just a possibility but a reality for every developer through a simple API call. We firmly believe that language is the key to unlocking AI's full potential, shaping a future where natural language is the backbone of human-computer interaction. As pioneers in AI-powered communication, Deepgram is committed to transforming how we connect with technology and each other.

Stay tuned for more exciting announcements soon! We’re just getting started! To learn more, please visit our API Documentation, and you can immediately try out our diarization and language detection features in our API Playground

If you have any feedback about this post, or anything else regarding Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions or contact us to talk to one of our product experts for more information today.


Footnotes

[1] Time-based Confusion Error Rate (CER) = confusion time / total reference speech time. Confusion time is the sum of all speech segments where the speaker is incorrectly labeled. Total speech time is the sum of all reference speech segments. Reference speech segments require accurate times, and those timings come from aligning the audio with our high-quality human-labeled transcripts.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.