Table of Contents
Named entity recognition is a solved problem on clean text. On voice transcripts, it breaks badly. Much enterprise unstructured data now starts as speech: contact center calls, clinical conversations, meeting recordings, and voice agent interactions.
NER models trained on well-formatted written text regularly lose 20 to 27 F1 points on raw ASR output, per a multilingual NER benchmark. That's where production pipelines fail. This article covers why that happens, which architectures help, and how transcript quality drives accurate entity extraction.
Key Takeaways
Here's what you need to know about named entity recognition on voice data:
- Pipeline ASR-then-NER remains the dominant production architecture. LLM extraction adds accuracy but not real-time speed.
- Domain-specific entities like drug names, legal citations, and account numbers degrade the most. They also benefit the most from vocabulary prompting.
- Improving STT accuracy and formatting usually has a larger impact on NER quality than switching NER models.
Why NER Fails on Voice Data
Voice NER fails because transcripts remove the surface cues text models expect. ASR errors then hit the exact spans that matter most.
How ASR Output Differs from Training Text
Standard NER models learn from datasets like CoNLL-2003, where text is properly capitalized, punctuated, and formatted. ASR output often arrives as a lowercase, unpunctuated stream. The word "apple" could be the fruit or the company. Without a capital letter, the NER model loses its strongest signal.
AAAI 2020 research found that NER performance drops by over 40 F1 points on standard datasets when casing information is removed, a measure of how much these models lean on surface-level formatting.
ASR systems also verbalize structured data rather than formatting it. A phone number spoken as "eight hundred five five five one two one two" appears as word tokens, not as "800-555-1212."
The Error Cascade from Transcript to Entity
ASR errors don't spread evenly across a transcript. ACL 2024 research found that substituting ASR errors only within entity spans caused large NER F1 drops. Substituting errors only outside entity spans caused just a 1.77 F1-point drop. Entity spans are where ASR errors hurt most.
Named entities tend to be rare words: proper nouns, technical terms, and brand names. These are exactly the tokens ASR models are most likely to misrecognize.
Where Accuracy Breaks Down by Entity Type
Not all entity types degrade equally. An Arabic speech NER study using Composite Error Rate found dramatic variation. Legal citations hit 100% CoER, meaning complete failure. Organizations reached 69% CoER from compounded casing loss, OOV errors, and syntactic disruption. Language names sat at 11% CoER because they form a small, high-frequency set.
ACL 2023 research found a second problem. Even with zero ASR word errors, NER models still miss 37% of entity spans in conversational speech. Spoken language uses different syntax, hesitations, and phrasing than the written text these models train on, and that mismatch persists regardless of transcription quality.
Architecture Choices for Voice NER
Your architecture choice determines the tradeoff between latency, accuracy, and flexibility. In production today, the pipeline approach still wins for real-time work.
Pipeline Approach: ASR, Then NER
The pipeline approach runs ASR first, then feeds the transcript to a separate NER model. It's a common production pattern. You can upgrade each component independently, and small NER models like GLiNER run on CPU hardware.
On real ASR call center transcripts, RoBERTa with model customization reached 0.82 micro-F1. DistilRoBERTa hit 0.81 with faster inference. Both outperformed Flair BiLSTM-CRF at 0.73. The key requirement is model customization on ASR-generated training data, not clean text.
The main weakness is error propagation. ASR mistakes flow directly into the NER stage, so a single misheard account number can quietly poison everything downstream. Multiple production teams address this by adding a correction layer on top of the pipeline rather than replacing the architecture.
LLM-Based Extraction from Transcripts
Replacing the NER model with a prompted LLM offers strong contextual performance. LLMs can infer entity boundaries without capitalization or punctuation cues. On CoNLL-2003 text benchmarks, LLaMA3.1-8B reaches 93.83 F1 with inline XML formatting, per a generative NER study.
The tradeoff is latency. LLM inference averages 670 milliseconds per generation, per a voice-agent latency study. Seven-billion-parameter models take 7.27 seconds per sample. That rules out real-time voice applications on current hardware.
LLMs also carry hallucination risk on noisy input. Recent error-correction research found that LLMs applied to ASR output fabricate words appearing in neither the transcript nor the reference at rates of 3 to 12%. On entity-heavy transcripts, a fabricated token is a wrong entity.
Joint Audio-to-Entity Models
Joint models reduce the ASR-to-NER handoff by producing transcript and entity annotations in one pass. They're promising, but still harder to deploy.
WhisperNER achieves 53.86 average F1 on zero-shot benchmarks. That's 1.57 points above the best pipeline baseline. On in-domain MIT Movie data, the gap widens to +6.01 F1 points.
WhisperNER requires 2B parameters. It needs roughly 8GB of memory and has no hosted inference provider. Upgrading the base ASR model requires full retraining.
Choosing by Use Case: Real-Time vs. Batch
For real-time streaming, the pipeline approach is the viable option today. That applies to voice agents and live call analytics. NER models add tens of milliseconds on top of ASR latency. LLM and joint-model approaches can't meet real-time constraints yet.
For batch processing, LLM-based extraction offers the best accuracy on rare and domain-specific entities. Apple research on retrieval-augmented LLM correction reports 33 to 39% relative WER reduction on rare music entities. Joint models like WhisperNER fit research and proof-of-concept work where you control the GPU infrastructure.
Improving NER Accuracy at the Source
If you want a better voice NER, improve the transcript first. STT accuracy, formatting recovery, and vocabulary control all reduce downstream extraction errors.
How STT Accuracy Determines NER Quality
Lower WER translates to better NER performance, but aggregate WER is a poor proxy: errors concentrated in entity spans hurt extraction far more than the overall number suggests. When evaluating STT providers for named entity recognition pipelines, track entity-level error rates, not just the headline WER.
Deepgram's Nova-3 model delivers a 5.26% median batch WER. For named entity recognition pipelines, that baseline accuracy reduces the number of misrecognized entity tokens before your NER model runs. Speech-to-Text API processes audio through deep learning models trained on real-world conditions. That includes noisy and accented audio where entity recognition matters most.
Smart Formatting and Casing Recovery
Formatting recovery is one of the simplest and most effective fixes for voice NER. Restoring punctuation, capitalization, and normalized surface forms makes transcripts look more like NER training data.
Deepgram's Smart Format feature restores punctuation, capitalization, and normalizes dates, times, currency amounts, phone numbers, and email addresses. These formatting changes can convert verbalized tokens back toward the written surface forms NER models expect.
One configuration note worth catching before it bites you: if you're consuming streaming transcripts, smart_format=true and no_delay=true can limit formatting in some cases. That reduces the NER benefit.
Keyterm Prompting for Domain Entities
Domain-specific terms are where ASR fails most. They're also where NER accuracy matters most to your business. Drug names, product names, and technical jargon are low-frequency tokens that ASR models frequently misrecognize.
Keyterm Prompting lets you supply up to 100 domain-specific terms per request. The model uses both the keyterm formatting and audio context to determine the final transcription. Critically, keyterms preserve case and punctuation.
Terms like "iPhone," "Dr. Smith," or "tretinoin" appear in the transcript with the exact surface form you specified. Deepgram's documentation shows illustrative examples where "try to win" becomes "tretinoin" and "building" becomes "billing" when the correct keyterms are supplied.
Reserve your 100-keyterm budget for low-frequency, high-value vocabulary. Common words don't need prompting, and burning slots on them is wasted effort. The entities your business cares about are the terms to supply.
Domain-Specific Entity Extraction Patterns
The hardest entities depend on your vertical. The transcript problems are similar, but the failure modes change by use case.
Contact Centers: Agent Names, Account IDs, and Product Mentions
Contact center transcripts carry agent names, customer account identifiers, product references, and order numbers. Account IDs spoken as digit sequences ("four seven two nine eight") arrive as word tokens without numeric formatting.
Smart formatting converts these back to digit strings. Product names and brand terms benefit from keyterm prompting, especially when products have non-obvious spellings.
Five9 doubled user authentication rates after integrating Deepgram's speech recognition into their IVR system. Accurate transcription of alphanumeric data directly improved downstream entity extraction and automated routing.
Healthcare: Clinical Terminology and PHI
Healthcare voice NER degrades sharply when transcription misses clinical terms. Drug names and other medical entities are low-frequency, high-risk tokens.
NAACL 2025 industry research measured a 17 F1-point drop when running NER on ASR output versus reference text in a medical domain. Drug names are polysyllabic, Latinate, and low-frequency in general ASR training data. Near-homophone substitutions like "renal" for "adrenal" are acoustically plausible but clinically dangerous.
For healthcare voice NER, keyterm prompting with drug names, procedure codes, and clinical terminology is a requirement, not a nice-to-have. Deepgram offers cloud, self-hosted, and private cloud deployment, and its supported models and languages cover the clinical vocabularies these pipelines depend on.
Building a Production Voice NER Pipeline
A production voice NER pipeline should improve the transcript before extraction. Sequence the components so you reduce entity errors at the source.
Sequencing Your Pipeline
Start by selecting an STT provider and measuring entity-level WER, not just aggregate WER. Turn on formatting features before your transcript reaches the NER model. That includes punctuation, casing, and numeral normalization. Supply domain keyterms to reduce entity-span transcription errors at the source.
For real-time use cases, pair a streaming STT API with a lightweight transformer NER model. RoBERTa or DistilRoBERTa models with customization are the documented best practice for production NER on transcripts. For batch processing, an LLM-based extraction pass over formatted transcripts gives you the best coverage of rare and domain-specific entities. In both cases, transcript quality is the variable with the largest impact on NER output.
Getting Started with Deepgram
Deepgram's STT API provides the transcript quality, smart formatting, and keyterm prompting that voice NER pipelines depend on. You can start testing your pipeline today.
Get started free and get $200 in credits to run your audio through Nova-3 with Smart Format and Keyterm Prompting enabled. See how your downstream NER accuracy changes when the transcript gets better.
FAQ
What Is the Difference between NER on Text and NER on Voice Transcripts?
Text NER runs on cased, punctuated input with clearer boundaries. Voice transcript NER must handle lowercase text, missing punctuation, and ASR errors concentrated in entity spans.
Can I Run NER in Real Time on Streaming Transcriptions?
Yes, with the pipeline approach. Lightweight transformer models like DistilRoBERTa add only tens of milliseconds, while LLM-based extraction isn't viable for real-time use at current inference speeds.
Which NER Libraries Work Best with ASR Output?
spaCy needs upstream casing and punctuation restoration. Zero-shot models like GLiNER reduce labeling needs, but customized BERT variants and RoBERTa remain the strongest documented options.
How Does Transcript Accuracy Affect Entity Extraction Quality?
The relationship isn't linear. Track entity-span WER, not just overall WER. Formatting fixes like casing, punctuation, and numeral normalization address a separate source of NER degradation.









