Share this guide

My Co-Founder and I were kicking around the idea of a search engine that would let a person find phrases in a block of audio. We were looking for something that could peer into interviews, podcasts, video lectures - things like that. And if it was done right, you would be able to search through many seasons of a certain TV show and find all the crucial moments like, "You're fired!". We thought, 'This has to exist, right?'. Surprisingly, no. There wasn't a company out there that really provided the functionality. Certainly not in a way that was useful to us, at least. So we started hacking together a Google-based transcription to see if we can get a barebones prototype going. In a couple days it was running - search for something, and most of the time you got it. Huge pat on the back, right?

Speech recognition is hard.

Reality hit us when we noticed a problem. Sometimes the phrase was definitely spoken-you could hear it plain as day in the audio stream-but the search missed it. It turns out this is due to the inaccuracy of automatic speech transcription software. We went on a quest to get our hands on some top quality speech recognition bad-assery. What we were met with was another dose of reality; speech recognition is hard. More evidence emerges when you dig into the current audio research scene and notice that this topic is still a very active topic. The big tech companies (Google, Microsoft, Apple, etc.) put forth large efforts to get this sort of thing right. Even after that, you generally only get 90% word accuracy. That's on very clean, well recorded speech. With input sources containing conversational speech of questionable quality-say, YouTube videos-the word error rate get pretty bad (more than half is wrong sometimes!).

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo
Essential Building Blocks for Voice AI