As a company that builds speech-recognition software, we pride ourselves on delivering accurate, high-quality transcriptions. And that’s no simple feat. Building an automated speech recognition (ASR) model requires in-depth expertise on neural networks, an understanding of CTC loss, multiple servers, and the ability to code with AI frameworks

(Note: We chose PyTorch.)

But even with all these tools in our belt, how exactly do we measure the quality of an ASR model?

Well, as it turns out, that’s a tough question to answer. There are quite a few cases to consider when measuring the quality of a transcription.

But let’s walk through these cases. And in the end, we’ll see why the (in)famous “Word Error Rate” is the metric of choice for ASR companies.

Part 1: The types of errors

Let’s say that we have an audio file where a person utters a sentence. For simplicity’s sake we’ll say that sentence is:

I really like grapes.

Yep, simple enough. We have a ground-truth transcript against which to measure our AI’s outputs. How tough could it be to quantify the differences between the truth and these outputs?

Well, let’s say that we have three different AI models, and each model outputs something different:

  • Model A’s output: I really really like grapes.

  • Model B’s output: I like grapes.

  • Model C’s output: I really like crepes.

Each of these models is wrong in their own way. But which one is closest to the truth? If you’re in the market for a transcription model, and these are the only three ASR tools available, which one should you buy?

Well, let’s go over these errors one by one and see if we can come up with a logical and intuitive mathematical formula to quantify how right or wrong these transcripts are.

Model A inserted an extra “really” into its transcription. However, an argument can be made that this transcription is the “best” one, since every single word that’s present in the ground-truth is also present in the transcription.

Model B, meanwhile, has the opposite problem as Model A. It deleted the word “really.” Maybe this model just didn’t hear that utterance. However, at least here we can say that every word in the transcription is also present in the ground-truth.

Model C made the mistake that humans are most prone to: mishearing words. After all, the “g” sound is just the “c” sound with some vocal cord backup [see footnote]. An argument can be made that Model C is the “best” transcription since it’s technically the most human-like.

So… which one really is the best? Well, in the world of ASR and linguistics, we’ve decided on the following philosophy to guide our metrics: All errors are created equal. 

That’s right. It’s a Democratic Republic of Errors (DRE). Whether the model’s output added a word, subtracted a word, or replaced a word, all are reprimanded equally. No single type of error is weighted more heavily than the others.

Part 2: The FormulaTM

And that’s where we arrive at the formula for word error rate:

Given a transcription and a ground-truth, we add up (1) the number of words the AI inserted with (2) the number of words the AI deleted and (3) the number of words the AI substituted. We then take that sum and divide it by the number of words in the ground-truth transcript.

Written out, the formula looks like this:

Or, in more mathematically quantifiable terms:

And so, if we were to calculate the WER of each of the models’ sentences above, we’d arrive at the following results:

Model A: I really really like grapes.

Model B: I like grapes.

Model C: I really like crepes.

See? All errors are treated equally! And so the word error rate of each of the models above evaluates to ¼. Keep in mind that a lower WER is better. A perfect transcript, therefore, would have a WER of 0, since it would’ve inserted, deleted, or misheard any words.

And if you want to see what this formula looks like in code, check out the function below!

def wer(ref, hyp):
   #Remove the punctuation from both the truth and transcription
   ref_no_punc = ref.translate(str.maketrans('', '', string.punctuation))
   hyp_no_punc = hyp.translate(str.maketrans('', '', string.punctuation))
   #Calculation starts here
   r = ref_no_punc.split()
   h = hyp_no_punc.split()
   #costs will holds the costs, like in the Levenshtein distance algorithm
   costs = [[0 for inner in range(len(h)+1)] for outer in range(len(r)+1)]
   # backtrace will hold the operations we've done.
   # so we could later backtrace, like the WER algorithm requires us to.
   backtrace = [[0 for inner in range(len(h)+1)] for outer in range(len(r)+1)]
   OP_OK = 0
   OP_SUB = 1
   OP_INS = 2
   OP_DEL = 3
   DEL_PENALTY = 1
   INS_PENALTY = 1
   SUB_PENALTY = 1
  
   # First column represents the case where we achieve zero
   # hypothesis words by deleting all reference words.
   for i in range(1, len(r)+1):
       costs[i][0] = DEL_PENALTY*i
       backtrace[i][0] = OP_DEL
  
   # First row represents the case where we achieve the hypothesis
   # by inserting all hypothesis words into a zero-length reference.
   for j in range(1, len(h) + 1):
       costs[0][j] = INS_PENALTY * j
       backtrace[0][j] = OP_INS
  
   # computation
   for i in range(1, len(r)+1):
       for j in range(1, len(h)+1):
           if r[i-1] == h[j-1]:
               costs[i][j] = costs[i-1][j-1]
               backtrace[i][j] = OP_OK
           else:
               substitutionCost = costs[i-1][j-1] + SUB_PENALTY
               insertionCost    = costs[i][j-1] + INS_PENALTY
               deletionCost     = costs[i-1][j] + DEL_PENALTY
               
               costs[i][j] = min(substitutionCost, insertionCost, deletionCost)
               if costs[i][j] == substitutionCost:
                   backtrace[i][j] = OP_SUB
               elif costs[i][j] == insertionCost:
                   backtrace[i][j] = OP_INS
               else:
                   backtrace[i][j] = OP_DEL
               
   # back trace though the best route:
   i = len(r)
   j = len(h)
   numSub = 0
   numDel = 0
   numIns = 0
   numCor = 0
   while i > 0 or j > 0:
       if backtrace[i][j] == OP_OK:
           numCor += 1
           i-=1
           j-=1
       elif backtrace[i][j] == OP_SUB:
           numSub +=1
           i-=1
           j-=1
       elif backtrace[i][j] == OP_INS:
           numIns += 1
           j-=1
       elif backtrace[i][j] == OP_DEL:
           numDel += 1
           i-=1
   wer_result = round( (numSub + numDel + numIns) / (float) (len(r)), 3)
   results = {'WER':wer_result, 'numCor':numCor, 'numSub':numSub, 'numIns':numIns, 'numDel':numDel, "numCount": len(r)}
   return results

Part 3: But That’s Not the Whole Story

However, remember as we’ve stated before: While WER is a good, first-blush metric for comparing the accuracy of speech recognition APIs, it is by no means the only metric you should consider. Importantly, you should understand how the speech recognition API will deal with your data. What words will it transcribe with ease? What words will give it trouble? Where will it flourish, and where will it stumble? What words matter to you? 

So next time you see an ASR company flexing how low its WER is, you’ll know exactly what they’re talking about!


Footnote: (No, seriously, you use the exact same mouth-position and vocal posture to produce a “c” and a “g” sound. The “c” just requires your vocal cords to rest while the “g” sound requires your vocal cords to be active.)

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo