
By Keith Lam
Deepgram Alum
Last Updated
A funny thing happened when Deepgram first decided to use end-to-end deep learning (E2EDL) to design our next-generation speech-to-text (STT) solution. We found that this approach was hugely flexible and easier to optimize than traditional STT. We didn't have to reconnect and optimize multiple models (acoustic, pronunciation, and language) every time we wanted to make a change. And we could retrain and enhance our speech models without starting from scratch. With transfer learning, we could build new speech models faster. This trait of our technology has allowed us to build different base speech models for different use cases and needs. It also allows us to tailor models in cases where a customer needs something specific that we don't currently offer. Let's take a look at the two types of models that we offer here at Deepgram and what each is good for.
1. Language-by-Use Case Models
All of our use case-specific models are available in various English dialects. We are expanding into different language-by-use case combinations as we continue to train and optimize our speech models for specific circumstances, such as call centers or meeting transcription, as well as expanding the spoken languages we offer. Our customers have found that combining a spoken language and use case to create a speech model that works specifically for their needs is more accurate than Big Tech's out-of-the-box, one-size-fits-none models. These targeted models have the fastest speed and are optimized for the best scalability. Our models can transcribe one hour of pre-recorded audio in 30 seconds. These models are great for all applications, especially ones that need very high speeds or cost savings for on-prem use. You also don't need to trade off speed or scalability for high accuracy and because we have multiple models for different use cases-unlike Big Tech-our models tend to be more accurate as well.



