My Co-Founder and I were kicking around the idea of a search engine that would let a person find phrases in a block of audio. We were looking for something that could peer into interviews, podcasts, video lectures - things like that. And if it was done right, you would be able to search through many seasons of a certain TV show and find all the crucial moments like, "You're fired!". We thought, 'This has to exist, right?'. Surprisingly, no. There wasn't a company out there that really provided the functionality. Certainly not in a way that was useful to us, at least. So we started hacking together a Google-based transcription to see if we can get a barebones prototype going. In a couple days it was running - search for something, and most of the time you got it. Huge pat on the back, right?
Speech recognition is hard.
Reality hit us when we noticed a problem. Sometimes the phrase was definitely spoken-you could hear it plain as day in the audio stream-but the search missed it. It turns out this is due to the inaccuracy of automatic speech transcription software. We went on a quest to get our hands on some top quality speech recognition bad-assery. What we were met with was another dose of reality; speech recognition is hard. More evidence emerges when you dig into the current audio research scene and notice that this topic is still a very active topic. The big tech companies (Google, Microsoft, Apple, etc.) put forth large efforts to get this sort of thing right. Even after that, you generally only get 90% word accuracy. That's on very clean, well recorded speech. With input sources containing conversational speech of questionable quality-say, YouTube videos-the word error rate get pretty bad (more than half is wrong sometimes!).