Breaking Brahmic: How OpenAI's Text Cleaning Hides Whisper's True Word Error Rate for Many South Asian Languages
Tamil is a language spoken by 75 million people. Many of its speakers are centered in India’s southern state of Tamil Nadu and in Sri Lanka and Singapore, and there is a Tamil diaspora spread around the world. To serve all of these speakers, Deepgram recently built out a language model to transcribe Tamil.
I helped lead Deepgram’s Tamil language modeling effort. While we were working on it OpenAI released Whisper, an open-source speech-to-text model with support for nearly 100 languages—including Tamil. When we read through their paper, the error rate they claimed for Tamil seemed implausibly low. Taking that claim at face value it would imply that Whisper was, among other things, the best Tamil model in the world.
It’s a bold claim worthy of some further investigation. Upon closer examination we found that Whisper’s performance on Tamil is pretty good, but not nearly as good as they claimed. As is so often the case, the devil is in the details. The key detail in this context: OpenAI “cleans” transcripts before evaluating their accuracy, and a bug in their cleaning process produces apparent performance far better than what the Whisper model can actually achieve. Once compared on a more even playing field, Deepgram’s Tamil model outperforms OpenAI Whisper’s Tamil model in accuracy (as well as speed and cost of use).
This issue we found is not unique to OpenAI’s treatment of Tamil either. It applies to other languages which use related writing systems, including some of the world’s most widely spoken languages, like Hindi and Bengali. In other words, OpenAI is not reporting its model’s accuracy correctly for a set of languages spoken by over a billion people around the world.
The chart below helps to illustrate the scope of the issue.
In this article we cover the basics of how we measure transcription accuracy, how a bug in OpenAI’s text cleaning methodology “breaks” Tamil and several other Southeast Asian languages, and what all this means when trying to assess the performance of Whisper versus Deepgram for the languages in question.
Word Error Rate: What you need to know
The basic measure we use to evaluate a transcription model is Word Error Rate, or WER. The idea behind WER is very simple—count the number of word substitutions, deletions, and insertions required to get from the transcription to the true text, then divide by the number of words in the true text.
There are two things to keep in mind about WER when using it to evaluate transcription model:
WER doesn’t account for the perceived severity of mistakes. If our model returns “horse” when the true text was “zebra,” that’s one error (a substitution). But if our transcription model returns “book shelf” when the true text was “bookshelf,” it will accumulate two errors (one substitution and one insertion), even though a reader might not notice the error.
To mitigate the problem in (1) most practitioners do some type of cleaning of the transcription and true text before evaluating WER. The hope is that by standardizing the text we can remove formatting errors, so that the remaining errors really are due to transcription. But because there are many different possible cleaning procedures, “WER” is more a family of measures than a single unambiguous metric.
In practice, WER remains ubiquitous because it is easy to compute, easy to understand, and gives roughly consistent results so long as practitioners make reasonable choices about text cleaning.
How OpenAI's text cleaning breaks Tamil
As we were digging into Whisper’s WER results for Tamil, we struggled to explain how they could be quite that good. We crossed off one possible explanation after another, until we were left with text cleaning as the only one remaining.
Let’s see what OpenAI has to say about their text cleaning (emphasis added):
[S]ystems that output transcripts that would be judged as correct by humans can still have a large WER due to minor formatting differences. [...] We opt to address this problem with extensive standardization of text before the WER calculation to minimize penalization of non-semantic differences. Our text normalizer was developed through iterative manual inspection to identify common patterns where naive WER penalized Whisper models for an innocuous difference. Appendix C includes full details. For several datasets, we observe WER drops of up to 50 percent usually due to a quirk such as a dataset’s reference transcripts seperating [sic] contractions from words with whitespace.
As a comparison point, we generally find that models become useful at a WER of about 30 percent, so a 50 percent drop is a big deal! Of course we flip ahead to Appendix C (page 21 of the official Whisper paper) to try to figure out what’s going on. We find that non-English languages all get the same text-cleaning, with some innocuous steps like casting text to lowercase and removing parentheses. And also:
3. Replace any markers, symbols, and punctuation characters with a space, i.e. when the Unicode category of each character in the NFKC-normalized string starts with M, S, or P.
To understand why this is so startling, we need to talk a little bit about the writing system for Tamil. It is an example of a “Brahmic script”, a family of writing systems used throughout India and across Southeast Asia. In these systems there are two ways to write each vowel, one that is "independent", so a letter in the usual sense, and another as a ligature attached to a consonant. For example, here’s the independent form of “I”: இ
When paired with a consonant within a syllable, it’s written as: ி
For example, the syllable “ki” would be written as:
k + i = ki
க் + இ = கி
You can see the ligature as the curly bit on the right. At this point you might have guessed the issue: In Unicode, all of these ligatures are categorized as “markers”. This means that Whisper’s text cleaning removes them—removes most of the vowels in a typical Tamil word—and replaces them with spaces. To give a sense of how that impacts the readability of a sentence, here’s a before and after example:
Original: "அலாஸ்காவின் ஃபேர்பேங்க்ஸுக்கு தெற்கே ஆயிரக்கணக்கான பீப்பாய்கள் கச்சா எண்ணெயைக் கொட்டியதைத் தொடர்ந்து டிரான்ஸ்-அலாஸ்கா பைப்லைன் அமைப்பின் 800 மைல்கள் மூடப்பட்டன."
Transliterated: "Alāskāviṉ ḥpērpēṅksukku teṟkē āyirakkaṇakkāṉa pīppāykaḷ kaccā eṇṇeyaik koṭṭiyatait toṭarntu ṭirāṉs-alāskā paiplaiṉ amaippiṉ 800 mailkaḷ mūṭappaṭṭaṉa."
"Standardized": "அல ஸ க வ ன ஃப ர ப ங க ஸ க க த ற க ஆய ரக கணக க ன ப ப ப ய கள கச ச எண ண ய க க ட ட யத த த டர ந த ட ர ன ஸ அல ஸ க ப ப ல ன அம ப ப ன 800 ம ல கள ம டப பட டன "
Transliterated: "Ala sa ka va ṉa ḥpa ra pa ṅa ka sa ka ka ta ṟa ka āya raka kaṇaka ka ṉa pa pa pa ya kaḷa kaca ca eṇa ṇa ya ka ka ṭa ṭa yata ta ta ṭara na ta ṭa ra ṉa sa ala sa ka pa pa la ṉa ama pa pa ṉa 800 ma la kaḷa ma ṭapa paṭa ṭaṉa"
Remember, text cleaning is applied before computing WER. Given that this standardization breaks most words into a series of consonants, we don’t think it’s correct to call Whisper’s quoted WER results for Tamil, Hindi, and most other languages from India and Southeast Asia “word error rates”. All of the added spaces increase the number of “words” in the sentence; incorrect vowels often disappear, and incorrect consonants get counted in isolation, rather than affecting the word they were formerly part of.
All of these effects work to reduce the “WER” quoted for these languages—in our estimation, by about 30%.
Beyond Tamil: A billion speaker issue with OpenAI's text cleaning
As we’ve alluded to, Tamil is the language where we first spotted the issues described above, but OpenAI’s text cleaning procedure draws into question Whisper’s stated word error rates across more than a dozen languages written in Brahmic scripts, including some of the most widely-spoken languages in the world.
We checked, and every one of those languages is affected by this bug in a similar way. Altogether, it’s a bug affecting languages spoken by over one billion people.
How do oversights like this happen, and how can they be prevented? We don’t know how OpenAI developed their text cleaning workflow, and so can’t say how they could have avoided this particular error. However, from Deepgram's research experience, we’ve learned that having at least one native or near-native speaker of a given language on the model development team is the best way to identify and correct issues like the one we explained above. There's huge diversity in languages and how they are written, and we have to account for that in order to build the world’s best transcription models.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .