Why Medical Transcription is Hard for Humans and Machines
Medical transcription is hard for humans and machines. Human minds and fingers strain to parse and annotate ever more specialized and expanding medical terminology spoken in a kaleidoscope of accents and dialects over humming machines, crackling intercoms, droning hallway chatter, whopping emergency helicopter rotors, and many more environmental noises common in medical facilities.
Building artificial intelligence (AI) models that approach human medical transcriptionists' precision has been a long slog, but machines are gradually improving. Deepgram's recent Nova-2 medical transcription model, for example, one of the most accurate on the market with an 8.1% median Word Error Rate (WER), improved on Deepgram's first Nova medical transcription model by an impressive 11% but took over a year to create. What exactly is it that makes AI medical transcription difficult? We'll explore some specific aspects that complicate medical transcription for humans and machines (compared to transcribing everyday speech), but let's first think about what perfect medical transcription ought to look like so that we have a reference point in our minds to later compare automated medical transcription to.
Ideal Medical Transcription
Medical transcription needs to achieve two key things. First and foremost, it must be accurate since even tiny transcription errors in medical contexts could turn lethal.
Accuracy First
Exactly how accurate does medical transcription need to be? WER measures how far off transcription systems' outputs are from the original audio. It might seem hyperbolic, but even a 1% WER is unacceptable because one incorrect dose, one mipmapped acronym, or one instance of swapping left for right (plus many other types of mistakes that we'll look into later) could lead to disastrous health outcomes. Medical transcriptions should exactly reflect what the doctor said and recorded (though if a human medical transcriptionist notices something unusual, they’re encouraged to verify with the doctor who recorded the audio).
Speed Second
While speed takes a backseat to accuracy in medical transcription, timely transcriptions still matter a lot. At minimum, transcription processes must keep up with the rapid flow of patient visits and shift changes in health clinics. Ideally, a doctor's dictated spoken thoughts or the entire doctor-patient visits should be transcribed and added to databases before the initial doctors' shifts end to ensure that the next doctors have access to updated, accurate information and can pick up where previous doctors left off. But since medical transcriptions are also utilized by nurses, pharmacists, and other medical specialists, transcriptions would ideally be completed, blessed off on, and disseminated to other relevant caretakers before a doctor's shift ends. So the faster, the better.
Thankfully, machines can transcribe much faster than humans, so speed isn’t really an issue. Deepgram's Nova-2 Medical Model, for example, can transcribe an hour of audio in less than 30 seconds on batch mode and can transcribe live audio in near real time.
Acceptable Error Rates for Automated Medical Transcription
If even a one percent error rate could inadvertently kill patients, you might wonder, What's the point of using machines to automate medical transcription? Surely software can't transcribe medical speech at greater than 99% accuracy? And even if machines somehow become more accurate than human transcriptionists, we’ll probably still want a culpable human at the helm for instances where some slight transcription error wreaks havoc. It's unobvious, for example, how one would open a malpractice suit against a machine for incorrectly transcribing dosages, organs, diseases, or allergies. If it's indeed the case that we never fully cede all control to silicon-based medical transcriptionists and demand humans to give the final sign-off on medical transcriptions (a probable scenario), what even is the point of spending all the effort creating AI medical transcription models?
Human-in-the-Loop Machine Medical Transcription
Deepgram spends the time engineering AI medical transcription models because there exists some accuracy sweet spot where it becomes worth the time to do so—even if they're imperfect. If, for example, AI-generated transcription contains few enough errors (on average) to allow human transcriptionists to review machine-generated transcripts and correct those errors quicker than those humans could transcribe audio clips from scratch, then humans could use machines to transcribe more audio faster. In this scenario, human and machine medical transcriptionists might become like an editor-writer team where:
The machine medical transcription model acts as a "writer," generating a transcription "rough draft."
The human transcriptionist, or the doctors themselves, then act as an "editor," ensuring everything the machine transcribed is copacetic.
Transitioning professional medical transcriptionists from listeners-writers to listeners-editors (ideally just correcting a few mistakes along the way) could seriously boost the workload that a single human medical transcriptionist could manage. If that's the case, doctors get more time to be present with their patients and less "pajama time,” time spent at home catching up on paperwork, a contributing factor to increasing physician burnout rates.
Another scenario exists, however, where AI-generated medical transcription is useless. AI-generated medical transcriptions can contain enough errors to render whatever speed advantages they enjoy useless. When this happens, it'd be quicker for human transcriptionists to toss out the machine transcription altogether and just do it all themselves. This accuracy threshold is not some exact figure. To gauge whether some AI medical transcription model might increase or decrease the average human transcriptionist's output probably requires experimentation and feedback from experienced human medical transcriptionists.
Things that Complicate Medical Transcription
To become useful enough to be used in production, AI medical transcription needs to get many, many details right, handle uncertainty and different accents and dialects, learn continuously, generalize to noisy environments, and secure the data it processes. We'll go through each of these areas and consider what makes them so difficult to pull off and some examples of where they can go wonky.
Attention to Details
Let's revisit accuracy. Medical transcriptionists must get all the little things right because tiny transcription errors can create huge problems. Swapping left for right might not be a big deal for a patient with a tingling arm (though it very well could be), but it's a huge deal for someone awaiting a kidney transplant. Many seemingly "minor" errors could lead to serious medical mishaps, so medical transcription systems, human or machine, must pay close attention to details.
An Ear for Quantities
One of the most important transcription details an AI model needs to get right is numbers. Leading and trailing zeros often contribute to "10 fold" dosage errors where clinicians or pharmacists overlook a zero or misplace a decimal.
For some pharmaceutical doses, this could be lethal. Insulin, morphine, and blood thinners, for example, have minimal wiggle room for error. Similarly, robot-assisted surgery often requires physicians to specify incision sizes. We can’t mess those up. Lab numbers need to be correct too. Mistranscribing a normal 138 sodium level as 183, for example, might lead to unnecessary kidney treatments. Vitals like blood pressure and heart rate must also be correct. Dido with tumor measurements. Most numbers must be correct in medicine because countless medical scenarios exist where mistranscribing quantities could lead to harmful or unnecessary treatments. Numbers are so vital to medicine, and they're so easy to mess up that AI medical transcription models must especially emphasize minimizing WER for quantities, numbers, and measurements.
LLMs, Acronyms, and Abbreviations
Clinicians frequently use acronyms and abbreviations to save time. Mixing up these acronyms or even getting one letter wrong is another mistake that can easily happen during transcription. And, like typing an incorrect number or decimal place, acronym mistakes can also create undesired consequences. TBI, for example, could mean "Traumatic Brain Injury" or "Tuberculosis Infection" depending on the broader context. Medicine has many such examples. Another perplexing characteristic of acronyms is that different acronyms are sometimes used to refer to the same entity. An electrocardiogram, for example, might be abbreviated as EKG or ECG. And then, complicating things even more, we pronounce some acronyms as if reading the word (e.g., COVID, SIDS, etc.) while we pronounce other acronyms by sounding out each letter (e.g., MRI, CPR, etc.). Automated medical transcription models need to handle all these types of ambiguities because they're common with medicine.
Niche Domain Terminology
All the acronyms might seem tough to juggle, but things get worse. As a layperson, if you listen to any given medical conversation, you're apt to hear so much jargon and specialized vocabulary to feel like they're listening to a foreign language.
Many medical terms contain Greek or Latin prefixes, roots, or suffixes. Some are prevalent enough in broader language that those of us outside the medical field can guess their meaning. For example, these are recognizable to many folks:
"poly" = many
"pathy" = disease
"itis" = inflammation
"a" = lack of or without
Other prefixes, roots, and suffixes are peculiar enough outside of the medical field that most laypeople without significant Greek or Latin fluency would be grasping at straws when trying to interpret their meaning. These, for example, might be tough without looking them up:
"cheil" = lip
"phthisis" = wasting or decay
"onych" = nail (e.g., fingernail or toenail)
"clysis" = irrigation or washing
To be effective, medical transcriptionists must become fluent in very specialized medical terminology. There’s no way around it.
Continuous Learning
And yet mastering all existing medical terminology, as tough as it is, isn't even enough, since medicine's lexicon constantly expands thanks to continual progress in procedures, pharmaceuticals, and medical technology. Medical transcriptionists must constantly learn new terms. Human transcriptionists need ongoing training to do this where machine learning models might learn new words via updated training data and either finetuning existing models or training new models from scratch. Deepgram, for example, had taken both approaches. They trained an entirely new Nova-2 model to improve on Nova-1, but you can also update their Nova-2 medical model with uncommon or new words (if you find that Nova-2 commonly mistranscribes specific medical terms, just reach out and ask about Deepgram's Custom Model Training).
Regional Differences and Accents
Clinicians come from all backgrounds, which means medical transcription models need to recognize and annotate regional differences and accents correctly. While American and British doctors both use English, for example, they sometimes use different terms or spellings to refer to the same thing. Below are a few such cases:
While human transcriptionists probably don't typically deal with regional differences (an American medical transcriptionist is unlikely to transcribe for British doctors and vice versa), AI medical transcription models intended to serve many locations should generalize to a variety of regional differences (another option is to train an AI medical transcription model for each region). Another confounding factor stems from clinicians who practice medicine in a language other than their native tongue; they may have strong accents. Automated medical transcription systems must train on these diverse accents to dictate them accurately.
Generalizable to Environmental Noise
Hallway conversations, intercoms, medical machines, nearby traffic, and emergency helicopters landing and taking off can all make hospitals noisy, and though doctors' offices tend to be quieter than hospitals, they aren't immune to noise disturbances or shoddy recording quality (many doctors use their phone mics to record).
Humans can contend with environmental noise by rewinding and focusing intently on noisy segments, but AI models tend to guess as best they can (hence their hallucination problems). If AI models aren't exposed to noisy audio in their training data, their WER will especially increase on audio segments with significant environmental noise. This means that medical transcription models either need to ingest training data with diverse environmental noise representative of real clinical settings or add Bayesian noise to otherwise clean medical recordings (or both).
Handling Uncertainty
Medical transcriptionists should also “know” when they don't know something. Both human transcriptionists and machine transcription systems should flag uncertain terms or phrases for further review. For humans, experience guides them; an old-hand gains intuition of when to double-check a term or consult with the doctor who recorded the audio. On average, professional medical transcriptionists correct errors, often by asking doctors for clarification more than six times daily (this turns medical transcriptionists into a form of quality control that wouldn't exist if doctors transcribed themselves, but also helps medical transcriptionists calibrate their confidence levels).
Since autoregressive speech-to-text models harness probabilities, it's possible for AI medical transcription systems built from them to assign confidence scores to different transcript sections, flagging low-confidence areas for human review. Whatever the system, human or machine, it should have a sense of when it's not confident in what it "heard" so that it can take follow-on disambiguation steps. The decision tree of what those follow-on steps should be could grow complex, but the simplest solution is probably to send the doctor a rough draft of the proposed transcript with the unclear areas annotated and let them correct it before signing off on the final version.
Privacy Preserving
Beyond generating technically accurate transcriptions, automated medical transcription software must also preserve the privacy of all the medical data processes because most nations regulate how their citizens' medical data should be safeguarded. Medical organizations worldwide are generally expected to secure their audio recordings and transcriptions, for example, though exact privacy legislations that spell out how to do this vary from location to location. Medical data privacy regulations might specify how a medical organization can harness medical transcription APIs or where medical data can be stored. For example, dozens of nations have data residency or data localization rules that require the cloud service providers storing or processing medical data to reside within the nation of origin that's using the medical data. Deepgram's self-hosting option helps medical organizations comply with these types of regulations.
How Accurate AI Medical Transcription Models Are Built
Humans and machines both become proficient in medical transcription via extensive training. Humans' training includes formal education, often at least at an associate's degree level, to gain fluency in medical terminology, anatomy, and healthcare documentation practices. Human transcriptionists also develop accurate and fast enough typing skills to listen to recordings and type what they hear nearly simultaneously.
Medical transcription training programs for humans include exercises to develop these skills, often using real medical dictations to simulate on-the-job conditions. Machines obviously don't need to develop their typing skills since they can produce text far faster than humans, but, like humans, they too need to be exposed to a diverse range of audio data that might need to be transcribed, though much, much more of it than a human needs to train on.
The Best AI Medical Transcription Models are Trained in Phases
AI medical transcription models often undergo multiple training phases that look something like this.
They first learn to transcribe general language by training on a wide array of audio-text pairs. We can think of this as roughly analogous to humans’ broad language acquisition from toddler to high school level.
Then that broad speech-to-text model can train on specialized medical terminology by finetuning on medical-heavy corpora. Though a speech-to-text model isn't learning to think the way a doctor thinks, the model gains medical vocabulary similar to that of a doctor after studying in medical school and practicing medicine for some years.
Finally, the speech-to-text model that was finetuned on medical corpora must tine-tune again, but this time specifically on pairs of audio and human medical transcriptions to learn the narrow task of dictating medical audio recording.
This multi-step training process for machine transcription models typically involves exposing the AI to vast amounts of medical audio data paired with accurate transcriptions. Deepgram's Nova-2 medical model, for example, trained on around 6 million documents to give general medical fluency and then on many high-quality human transcriptions to fine-tune the model to medical transcription tasks. Machine models tend to need far more data than humans to learn the patterns and nuances of medical speech (human medical transcriptionists don't need to read or listen to thousands of transcription samples before becoming proficient), and the machine's training data must be very diverse—including many different accents, medical specialties, and types of medical reports—to ensure the model can handle a wide range of real-world medical audio scenarios.
Data Challenges
A serious challenge in developing accurate machine medical transcription models is the scarcity of available high-quality, annotated medical speech data. Unlike general speech recognition, which can use publicly plentiful data like YouTube videos or podcasts, medical transcription requires specialized, often confidential medical data. Because it's tough to scoop up medical data en masse, data scarcity slows the development of medicine-specific machine learning models. This is because rare medical terms, acronyms, abbreviations, dialects, and accents are often underrepresented in the training data, which means that models aren’t likely to learn them.
Another complicating factor in training machine medical transcription models is how compartmentalized medical specialties are growing. This can require finding and updating ever more specialized datasets. All of this makes building an accurate medical speech-to-text model a challenging data engineering project in its own right (to say nothing of crafting the deep neural network architecture).
Why Building Accurate AI Medical Transcription is Worth the Toil
As medical speech-to-text models grow more accurate, they're increasing the potential for automated medical transcription to help scale healthcare in a world where many nations' expanding and aging populations are straining the time that doctors can dedicate per patient.
How?
Doctors ask their patients an exploratory series of questions; patients answers' influence a next series of questions from the doctor. This repeats in a decision-tree manner until a doctor eventually settles on some action (e.g., advice, medication, labs, operation, etc.).
Unfortunately, your doctor can't focus solely on the diagnosis conversation because proper patient care also requires meticulously documenting the conversation. The details gleaned during this process help clinicians infer causal relationships across time and maintain continuity of care across shift changes or long gaps between patient visits (healthcare data is often a sparse, irregular time series).
If your doctor doesn't precisely annotate what they learn in conversations, the details evaporate into the ether of unreliable human memory. A remedy for this is to type everything you two discuss. Staring at the back of your doc's head while they dutifully peck away at their keyboard recording every minutiae you discuss is about as appealing to you as it is to your doc (especially after they spent 10+ years refining their medical expertise). Doctors have limited time to accurately diagnose ailments and build rapport with their patients, and they’d rather devote that limited time to you, the patient, than annotating records.
To reduce this mechanistic notetaking process and make medical diagnoses feel less sterile, many medical organizations, counterintuitively, turn to machines. Many clinicians ask for patients' permission to record doctor-patient conversations. Doctors might also record their spoken thoughts after visiting with patients. Later, a trained transcriptionist listens and types those audio recordings. This process generally works well but has these bottlenecks:
Since they must master medical terminology (often requiring a 2-year degree) and since they must become fast, accurate typers, a limited supply of professional medical transcriptionists exists.
Transcribing takes time, creating a lag between when doctor-patient conversations take place and when records are entered into databases so that other medical specialists can access them.
Though it's difficult to train reliable, accurate, and fast medical transcription models (for all the reasons we discussed), it can be done. Deepgram's Nova-2 medical model, for example, is accurate enough to help human medical transcriptionists increase their output by acting more like editors than writers, which can help doctors maximize the time they can spend per patient or see more patients, which helps create better health outcomes and ease overburdened medical systems worldwide.
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.