Article·AI & Engineering·May 13, 2023

Why Enterprise Audio Requirements are More “Nuanced” at Real-time Speeds

Keith Lam
By Keith Lam
PublishedMay 13, 2023
UpdatedJun 13, 2024

Nuance recently acquired Saykara, a mobile speech recognition technology provider to expand their medical transcription business. This acquisition is one of many major investments and acquisitions in the Natural Language Processing (NLP) and Customer Service space that we are monitoring at Deepgram. Due to the recent acquisition of Saykara, we thought it would be a good time to review Nuance speech recognition capabilities and why customers creating real-time experiences should consider alternatives. Enterprise audio is more nuanced than it may seem - pun intended!

Leader of previous speech tech solutions

Nuance is a great brand in the speech recognition business and has been around for over 30 years. They have gobbled up smaller speech recognition businesses, including Saykara just recently to expand their medical transcription business. If you need speech to text transcription especially in the medical setting, they will always be on the list to evaluate. To be honest, our most recent survey indicates good satisfaction with Nuance.

Core architecture has remained unchanged

Nuance does a good job at speech to text transcription using their 1970's legacy speech model, called the Hidden Markov Model or tri-gram model. They have added some AI and keyword libraries to their models to improve their accuracy but technically they need to sacrifice transcription speed for this accuracy. So, for non-real time transcriptions, like medical transcriptions for medical records, they do an admirable job. They can add hundreds of medical specific terms, acronyms, and drug names to make their model more accurate but it slows down their transcriptions. Deepgram does not use this legacy tri-gram model. We built our speech recognition solution from scratch using a completely different architecture. Deepgram uses an end to end Deep Learning Neural Network, which in simple terms means we perform audio to text transcription in one AI-enabled step and we can continually improve our accuracy with more data at the same transcription speed. Due to our architecture, customers do not have to compromise accuracy vs. speed, speed vs. costs or cost vs. scalability. Our tests with their speech recognition engine shows they can transcribe 1 hour of normal speech data (500 MB with one CPU/GPU) in 1 hour. While, Deepgram can transcribe the same 1 hour in 30 seconds. Check out this demo of Deepgram speed compared to Google, and this demo of Deepgram scale.

Enabling real-time AI is Deepgram's forte

When we talk about real-time AI for Conversational AI virtual agents, sales or support agent enablement, or real-time compliance monitoring, you need both millisecond speed and high accuracy. Customers do not want to wait for the virtual agent to transcribe what you said, send that data to the AI engine, get a response and then turn the response from text to speech. Any lag in that process would cause customer dissatisfaction. Worst is if the response is incorrect or the virtual agent needs to ask the customer to repeat what they said, poor transcription accuracy. For real-time streaming, our AI Speech Platform transcription lag is under 300 milliseconds.

Compare us

You know I'm biased so do a comparison yourself or we can do a comparison for you. Get your comparison

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.