What Developers Need to Know About WER, KER, and KRR

By Zian (Andy) Wang

AI Content Fellow

Last Updated

Aug 14, 2025

Automated transcription and speech to text technology is one of the more mature areas in Artificial Intelligence and machine learning. Long before the rise of AI models, transcription technologies were already widespread. In 1952, Bell Laboratories developed the “Audrey” system, which recognized single digits when spoken out loud. Yet more than half a decade later, speech to text technologies is nowhere near perfect. Although many models have gotten really good at transcribing audio, error still exists and we need metrics to quantify those mistakes.

Measuring speech to text accuracy is not as simple as it sounds. Sometimes, a model might get several words wrong in a casual conversation, but you can still understand exactly what the speaker meant.

For example, if someone says “I’m gonna grab some coffee with my colleague tomorrow” but the transcription reads “I’m going to grab some copy with my calling tomorrow,” the meaning is still clear despite multiple errors. However, in other contexts like medical transcription, even a single word mistake can be devastating. A doctor dictating “patient shows no signs of infection” transcribed as “patient shows signs of infection” completely reverses the medical assessment and could lead to serious consequences for patient care.

This is why speech recognition accuracy can’t be measured with just one universal metric. Different applications require different ways of thinking about errors, and understanding these distinctions is crucial for developers building speech-enabled applications.

This article will introduce three of the most used and prominent speech recognition metrics, their pros and cons along with their code implementation.

Metrics vs Loss Functions

Before diving into the details of WER, KER, and KRR, we should first define the difference between metrics and loss functions. In the machine learning space, the two terms are often mixed up or understood as interchangeable. This is usually not the case.

Loss functions are used to train machine learning models. They are the mathematical components that tell the model where it went wrong and how to adjust its parameters to improve performance.

Metrics, on the other hand, simply refer to any function that gives us a performance gauge on the model’s predictions. Loss functions essentially do the same thing and can double as metrics. However, some metrics, such as WER, KER, and KRR, cannot be used as loss functions. This is either because the nature of the error they measure is unfit for model training, or because they cannot be differentiated and used in backpropagation, or both.

Word Error Rate (WER)

What is WER?

Word Error Rate is the most widely used metric for measuring speech recognition accuracy. It calculates the percentage of words that were transcribed incorrectly compared to the reference text.

The formula for WER is straightforward. You take the total number of word errors and divide by the total number of words in the reference text. Then multiply by 100 to get a percentage.

WER counts three types of errors:

Substitutions: Wrong words (“cat” instead of “bat”)
Insertions: Extra words that shouldn’t be there
Deletions: Missing words that should be there

Simple Example

Let’s break this down with a simple example. If someone says “I need to schedule a meeting with the engineering team” but the system transcribes it as “I need schedule a meeting with the engineering steam,” that’s one substitution error (“steam” instead of “team”) and one deletion error (the missing “to” between “need” and “schedule”).

With 9 total words in the reference, the WER would be 2/10 × 100 = 20%.

Pros of WER

WER treats all errors equally, which makes it useful for general transcription tasks. You get a complete picture of overall accuracy. Most speech recognition systems report WER as their primary metric because it’s easy to understand and compare across different models.

Deepgram’s models consistently achieve industry-leading WER performance across various audio conditions and domains.

Cons of WER

This equal treatment can also be a limitation. A transcription that gets a critical keyword wrong might have the same WER as one that only messes up filler words. The impact is completely different, but WER doesn’t distinguish between them.

The example shown in the introduction of the article illustrates the issue of WER perfectly. In the sentence discussing a patient’s condition, “patient shows no signs of infection”, the model’s transcription “patient shows signs of infection”, would only receive a WER of 1/6 = 16.66% but the mistake could be devastating.

WER also doesn’t account for semantic meaning. Sometimes a transcription with higher WER might actually be more useful than one with lower WER.

However, due to the simplicity and straightforwardness of WER, it is one of the most reported metrics in the speech to text industry. Many models use the values of WER as a benchmark for improvements.

Code Implementation

def calculate_wer(reference, hypothesis):
    # Split into words
    ref_words = reference.split()
    hyp_words = hypothesis.split()
    
    # Calculate edit distance
    d = [[0 for _ in range(len(hyp_words) + 1)] for _ in range(len(ref_words) + 1)]
    
    for i in range(len(ref_words) + 1):
        d[i][0] = i
    for j in range(len(hyp_words) + 1):
        d[0][j] = j
    
    for i in range(1, len(ref_words) + 1):
        for j in range(1, len(hyp_words) + 1):
            if ref_words[i-1] == hyp_words[j-1]:
                d[i][j] = d[i-1][j-1]
            else:
                d[i][j] = min(d[i-1][j] + 1,    # deletion
                             d[i][j-1] + 1,     # insertion
                             d[i-1][j-1] + 1)   # substitution
    
    wer = d[len(ref_words)][len(hyp_words)] / len(ref_words) * 100
    return wer

# Example usage
reference = "I need to schedule a meeting with the engineering team"
hypothesis = "I need to schedule a meeting with the engineering steam"
wer_score = calculate_wer(reference, hypothesis)
print(f"WER: {wer_score:.1f}%")

Keyword Error Rate (KER)

What is KER?

Unlike WER, which looks at every single word, Keyword Error Rate focuses only on predefined important words or phrases. KER calculates the percentage of these specific keywords that were transcribed incorrectly.

The calculation is similar to WER but much more targeted. You count the errors only in your chosen keywords. Then divide by the total number of keyword instances and multiply by 100.

Simple Example

Consider a customer service call where you’re tracking words like “cancel,” “refund,” and “complaint.” Even if the overall transcription has several minor errors, what really matters is whether these critical business terms were captured correctly.

If the customer says “I want to cancel my subscription and get a refund” and your system captures both “cancel” and “refund” perfectly despite other transcription mistakes, your KER for this interaction would be 0%.

Pros of KER

KER is particularly valuable in business applications where certain words trigger specific actions or workflows. Call centers use it to identify escalation situations. Compliance teams track required disclosures. Voice command systems focus on action words.

In the previous example presented in the introduction section, by utilizing KER and identifying the key terms that determine a medical diagnosis, we would be much more confident trusting the

You can customize KER to match specific business needs. This makes it much more relevant than WER for targeted applications.

Cons of KER

The challenge with KER is choosing the right keywords for your specific use case. Too few keywords and you might miss important information. Too many and the metric becomes less meaningful.

KER also requires you to know in advance which words are important. This isn’t always obvious, especially in exploratory applications.

Code Implementation

def calculate_ker(reference, hypothesis, keywords):
    ref_words = reference.lower().split()
    hyp_words = hypothesis.lower().split()
    keywords = [kw.lower() for kw in keywords]
    
    # Find keyword positions in reference
    ref_keywords = []
    for i, word in enumerate(ref_words):
        if word in keywords:
            ref_keywords.append((word, i))
    
    # Find keyword positions in hypothesis
    hyp_keywords = []
    for i, word in enumerate(hyp_words):
        if word in keywords:
            hyp_keywords.append((word, i))
    
    # Count errors
    errors = 0
    total_keywords = len(ref_keywords)
    
    if total_keywords == 0:
        return 0
    
    # Simple matching - more sophisticated alignment possible
    ref_keyword_words = [kw[0] for kw in ref_keywords]
    hyp_keyword_words = [kw[0] for kw in hyp_keywords]
    
    for ref_kw in ref_keyword_words:
        if ref_kw not in hyp_keyword_words:
            errors += 1
        else:
            hyp_keyword_words.remove(ref_kw)  # Handle duplicates
    
    ker = (errors / total_keywords) * 100
    return ker

# Example usage
reference = "I want to cancel my subscription and get a refund"
hypothesis = "I want to cancel my subscription and get a refund"
keywords = ["cancel", "refund", "complaint"]
ker_score = calculate_ker(reference, hypothesis, keywords)
print(f"KER: {ker_score:.1f}%")

Keyword Recognition Rate (KRR)

What is KRR?

Keyword Recognition Rate is essentially the flip side of KER. Instead of measuring how many keywords were wrong, KRR measures how many were right. The formula is simple: KRR = 100% - KER.

While this might seem like just a mathematical conversion, the framing matters significantly in business contexts.

Simple Example

Using the same customer service example from before, if the system correctly identifies 2 out of 2 keywords present in the conversation, KRR would be 100%.

Even if the overall transcription had some minor errors that affected WER, your keyword recognition was perfect.

Pros of KRR

Stakeholders and clients often respond better to positive metrics. Saying “we achieved 95% keyword recognition” sounds much better than “we had 5% keyword errors.” They mean the same thing, but the perception is different.

KRR becomes especially useful when defining service level agreements or presenting performance dashboards to executives. It’s also valuable for competitive positioning and ROI calculations where you want to highlight success rather than focus on failures.

Deepgram’s models typically achieve high KRR scores even in challenging audio environments. This makes them suitable for enterprise applications where consistent keyword recognition is critical.

Cons of KRR

The main limitation is that it’s just a reframing of KER. It doesn’t provide any additional technical insight beyond what KER already tells you.

Some technical teams prefer error-focused metrics during development because they make problems more obvious.

Code Implementation

def calculate_krr(reference, hypothesis, keywords):
    # Use the same KER calculation
    ker_score = calculate_ker(reference, hypothesis, keywords)
    
    # Convert to recognition rate
    krr = 100 - ker_score
    return krr

# Alternative: Direct calculation
def calculate_krr_direct(reference, hypothesis, keywords):
    ref_words = reference.lower().split()
    hyp_words = hypothesis.lower().split()
    keywords = [kw.lower() for kw in keywords]
    
    # Find keywords in reference
    ref_keywords = [word for word in ref_words if word in keywords]
    
    if len(ref_keywords) == 0:
        return 100  # No keywords to recognize
    
    # Count correctly recognized keywords
    hyp_keywords = [word for word in hyp_words if word in keywords]
    correct = 0
    
    for ref_kw in ref_keywords:
        if ref_kw in hyp_keywords:
            correct += 1
            hyp_keywords.remove(ref_kw)  # Handle duplicates
    
    krr = (correct / len(ref_keywords)) * 100
    return krr

# Example usage
reference = "I want to cancel my subscription and get a refund"
hypothesis = "I want to cancel my subscription and get a refund"
keywords = ["cancel", "refund", "complaint"]
krr_score = calculate_krr(reference, hypothesis, keywords)
print(f"KRR: {krr_score:.1f}%")

Choosing the Right Metric for Your Use Case

The choice between WER, KER, and KRR depends entirely on what you’re trying to accomplish with your speech recognition system.

WER-focused scenarios work best when you need complete transcription accuracy. If you’re building a system where every word matters equally, WER is your go-to metric. This includes applications like meeting transcriptions, podcast subtitles, or legal depositions where missing any word could be problematic.

KER/KRR-focused scenarios are ideal when you only care about specific information extraction. If your application revolves around detecting particular words or phrases, these targeted metrics will give you much more relevant insights than WER. Voice assistants, compliance monitoring, and automated call routing systems fall into this category.

Hybrid approaches make sense when you need both comprehensive and targeted accuracy. Many real-world applications benefit from tracking multiple metrics simultaneously. You might use WER to ensure overall transcription quality while using KER to monitor business-critical terms.

Industry-Specific Guidance

Different industries have evolved different standards for measuring speech recognition performance based on their specific needs and risk tolerances.

Healthcare applications typically use WER for clinical notes and patient documentation where complete accuracy is crucial. But KER is important for medication names, dosages, and critical medical terms where errors could be life-threatening.

Financial services companies focus heavily on KER for compliance terms and regulatory language that must be captured correctly. They use WER for general client communications and meeting transcriptions where complete context matters.

Media and entertainment companies rely on WER for subtitles and closed captions where viewers expect complete accuracy. Even one small error in subtitles in an episode of a TV show can be an annoyance to the viewer.

Customer service organizations often prefer KRR for satisfaction metrics and executive reporting because the positive framing works better in business contexts.

Of course, there is no one right answer and the choice of metrics will highly depend on the situation. In enterprise applications, usually all metrics are adopted one way or another to get a comprehensive overview of a model’s performance.

What Developers Need to Know About WER, KER, and KRR

Table of Contents

Table of Contents

Metrics vs Loss Functions

Word Error Rate (WER)

What is WER?

Simple Example

Pros of WER

Cons of WER

Code Implementation

Keyword Error Rate (KER)

What is KER?

Simple Example

Pros of KER

Cons of KER

Code Implementation

Keyword Recognition Rate (KRR)

What is KRR?

Simple Example

Pros of KRR

Cons of KRR

Code Implementation

Choosing the Right Metric for Your Use Case

Industry-Specific Guidance

Unlock language AI at scale with an API call.

Unlock language AI at scale with an API call.