Evaluating Healthcare AI Agents: Building Trustworthy Machines

Metric 1: Sparse Benchmarks
Metric 2: Accuracy (and how accurate is accurate enough?)
Metric 3: Regulatory, Legal, and Privacy Challenges
HIPAA, Security, and FDA Oversight
Multilingual Accessibility
A Liability Headache
Unresolved Questions
Human Factors
AI Agents: Cure or Caution?

Share this article

By Brad Nikkel

AI Content Fellow

Last Updated

Sep 12, 2025

Healthcare AI agents are worthless unless they measurably cut costs, save time, and streamline operations. Since AI agent systems depend on LLMs or vision models, local GPU or API bills can stack up fast. LLMs and voice models are getting better, and we can use cheaper models for agents that do simpler tasks, but we can't just assume automating tasks with AI agents will be worth it without figuring out how much money, time, or work they'll save. But how do we even quantify AI agents’ performance?

Well, the 2025 State of Voice Report has shown that 84%of organizations are expanding their budgets to integrate AI voice agents into their systems—both within and outside of medicine.

Thus, we can only conclude that this vast majority of companies must have calculated somehow that AI agents are worth the money. Let’s see which numbers exactly add up:

Metric 1: Sparse Benchmarks

Measuring any AI agents' effectiveness before going live is tricky because AI agents, by design, are supposed to handle niche, somewhat open-ended tasks. For vertical-based AI agents, like those in healthcare, these tasks are so specialized that they’re unlikely to have existing benchmarks. Worse, most AI agent evaluations that do exist often overlook cost, making it easy to form an inflated belief about AI agents’ practical value. If model A, for example, scores 85% on an AI agent benchmark but burns through $1 million in LLM API calls to do so, is it really better than model B, which earned 75% on the same benchmark but only spent $1k in LLM API calls?

Metric 2: Accuracy (and how accurate is accurate enough?)

Let’s assume that we eventually craft benchmarks for medical AI agents' many unique tasks. Are we out of the woods? Not quite. We would be able to gauge how well one set of agents performed compared to another set of agents, which would be helpful, but we’d still need to figure out the accuracy threshold we want AI agents to meet before we’d be comfortable deploying them to production.

For most tasks, we probably want at least human-level accuracy. For many medical and administrative tasks, human-level performance isn’t clear, so we first must test what that is, and only then could we devise benchmarks to test whether AI agents are nearing human performance.

And all the while, we’d need to keep updating benchmarks and questioning their validity for a few reasons. LLMs and other AI models often suffer from data leakage—where they accidentally learn something from a benchmark's test data that they shouldn't have—making them appear more performant than they really are. Also, conflicting financial incentives in healthcare could gunk up AI agent deployments. Insurance companies, for instance, are incentivized to deny claims to minimize their payouts, while healthcare providers might use AI agents to aggressively challenge denials, find optimal coding strategies, document every possible billable service, or even upcode (illegally billing more than they should for procedures or diagnoses). We might have already seen something similar play out; this AI system allegedly underestimated patients’ postacute care durations (down to the day) to reduce costs, yet ~90% of its denied claims were reportedly reversed on appeal. A lawsuit claims UnitedHealth intentionally leveraged these inaccurate predictions to avoid payouts.

This clinic-insurance tug of war could trigger an arms race where insurance companies and clinicians both employ clashing armies of AI agents to maximize their bottom lines. This dynamic might tempt actors on either side to design benchmarks that favor their AI agents. Patients might get squeezed in the middle unless we create clear, transparent, third-party, and regularly updated benchmarks that put health outcomes first—not profits. But it’s hard to see the healthcare industry regulating itself here, pointing to the need for outside oversight.

Metric 3: Regulatory, Legal, and Privacy Challenges

Even if we iron out AI agents' many technical kinks and develop benchmarks for them, their legal status is a swamp. Current laws certainly aren't ready for AI agents playing doctor, and they’re probably not even ready for AI agents to complete digital paperwork. Courts have plenty of experience handling traditional software mishaps, but AI agents are an entirely different beast. Their ability to learn, adapt, and act autonomously creates legal puzzles that existing precedent offers little help with. This legal uncertainty doesn't just complicate deployment—it stops many clinics from even considering AI agent innovation because what clinic dreams of causing a court case that defines some new liability law?

HIPAA, Security, and FDA Oversight

HIPAA compliance creates unique headaches for clinics that employ AI agents. Auditing AI agents' data access is particularly tricky, given their ability to process thousands of records per minute. While logging every AI agent activity is feasible, humans auditing these logs at scale might struggle to make sense of them. Another problem: unexpected links drawn by AI agents between seemingly unrelated patient data raise questions about whether those AI agents are "accessing" Protected Health Information (PHI). What if AI agents learn patients’ health details by correlating otherwise non-sensitive data? Or what if an AI agent uses patient data to learn and improve; does that constitute a "disclosure" under HIPAA, which typically involves sharing PHI with an outside entity? The Office of Civil Rights is still grappling with these types of questions.

There are many gray areas here, but we know this: AI agents should be treated like system users with specific access rights; every action must be logged and auditable; patient consent must explicitly cover AI agent interactions; and data used to train AI agents requires special handling beyond standard HIPAA protocols.

The FDA is closely monitoring healthcare AI. Their 2019 guidelines for "Software as a Medical Device" (SaMD)—and their more recent updates culminating in a 2025 draft guidance on "Artificial Intelligence-Enabled Device Software Functions"—affect AI agents, especially those that help doctors make clinical decisions. Such AI agents might require FDA approval. But this is tricky territory because how do you regulate something that's designed to evolve? Since AI agents are often designed to learn and adapt over time, what threshold of “behavior” change would trigger FDA reapproval?

Messing up medical data privacy is easier than you’d think. Children's Minnesota, for example, accidentally exposed 37,942 patients' appointment times after simply misconfiguring their electronic calendar system. Healthcare data is routinely breached for what are, in hindsight, easily fixable issues. Proactively identifying and addressing the many edge cases that can cause privacy and security problems before they become problems is already challenging enough but will grow more complicated if we deploy teams of interactive, semiautonomous medical AI agents—each with different data access requirements.

Multilingual Accessibility

Not all patients speak English. Having a resource that could significantly enhance the patient experience for individuals with limited English proficiency would undoubtedly be helpful. And today, we’re seeing machine translation capabilities improve with great strides in real time. In fact, real-time multilingual speech recognition is already available, and it covers ten different languages: English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch.

The only thing left is to expand into other languages with considerable accuracy, and our multilingual accessibility will greatly improve.

A Liability Headache

The next problem is the blame game. Healthcare providers using AI agents might be on the hook if (or more likely, when) their agents flop—much like how they're usually accountable for mistakes they make that stem from faulty EHR systems. Courts don’t tend to give doctors a free pass just because they trusted a buggy system's outputs. For example, decimal point errors in EHRs have altered pharmaceutical orders (a software bug), yet the clinicians who administered the incorrect doses faced the ensuing malpractice case because they were expected to double-check it. This precedent raises questions about AI agents: What if an AI agent "hallucinates" a plausible yet harmful recommendation? Do courts blame the physician who acted on it or the vendors who engineered the AI agents? What if we eventually allow AI agents the autonomy to prescribe medications or plan care, perhaps in underserved areas, only to see those agents err? How do we divvy up the blame?

Courts have held software firms accountable for software defects that harm patients, even when those companies attempted to dodge liability via slippery terms of service. In the Lowe v. Cerner case, for example, a patient sued an EHR developer after a post-op oxygen-monitoring order that clinicians entered into the EHR defaulted to the wrong time, leading to that patient’s severe brain damage. In this case, the court treated the EHR as a 'product,' suggesting that pure software flaws and inadequate warnings can invite liability, just like any other medical device.

In another case, Ambrose v. St. Joseph's Hospital of Atlanta, a patient sued the hospital (rather than the software maker) for failing to update a surgical microscope’s software, an oversight that allegedly caused UV burns. Although the hospital framed this as professional malpractice (blaming the doctor), eventually an appellate court deemed it ordinary negligence (blaming the hospital).

Courts are inconsistent about all this: some toss out design defect claims because algorithms aren’t viewed as “products,” while others use vendor contracts to place responsibility on clinicians. But, overall, these rulings indicate that both developers and healthcare providers might face legal scrutiny over AI agent errors—whether from design flaws or poor oversight.

Similar issues will likely arise with medical AI agents, and more frequently, since they're "black boxes"—learning and adapting in ways that providers and vendors can't fully control or anticipate. For now, healthcare AI agent vendors are navigating murky legal waters, as most medical software precedents apply to traditional, predictable systems—not the stochastic nature of AI agents.

Unresolved Questions

Looks like a legal and regulatory nightmare, right? It’ll likely worsen as AI agents weave deeper into medical practices. We’ll face many thorny questions that our current laws, legal precedents, and medical regulations aren’t built to handle. Here’s a taste:

How do we assign liability when errors involve multiple AI agent systems from different vendors?
What transparency rules ought we to apply to AI agent-generated medical decisions?
How does liability shift as AI agents evolve over time or are fine-tuned by different organizations?
Should medical malpractice insurance expand to cover AI agent-related incidents?
How do we handle responsibility for AI agent-based care across different legal jurisdictions?
How can we safeguard patient autonomy when AI agents increasingly act on their own?
Should medical boards supervise or certify AI agents—similar to how they oversee human practitioners?

When things go wrong, blockchain or similar technology might help us track AI agents’ decisions and figure out why mistakes happened. But knowing why AI agents failed won’t address the question: Who is responsible when AI agents go off the rails? As AI agents grow more autonomous, we'll need updated laws that protect patients without squashing innovation.

Human Factors

Healthcare’s human element throws more obstacles at AI agents. One such barrier is training medical staff to use AI agents. Clinicians are already swamped, juggling packed schedules and vast administrative demands. Schooling everyone up on AI agent systems requires a hefty training investment in time and cash. When do providers squeeze in the time for this, and who foots the bill for it?

Additionally, doctors, rightly, want ironclad proof that agents will boost, not bog down, their workflows. This touches on the lack-of-existing-benchmarks issue we discussed earlier, but there’s another problem: providers often expect AI to be flawless while shrugging off human slip-ups. One AI glitch could tank its reputation, even if it’s otherwise a game-changer (humans tend to hold this double standard for AI in other areas too, self-driving vehicles being a good example).

Trust won’t come easy, nor will it be a one-shot ordeal—it’ll demand constant calibration. Too much trust risks blind faith in AI agents’ recommendations; too little might cause us to dismiss golden insights or reject their helping hands. And any trust we do build? It’s not a one-size-fits-all deal—it’ll depend on the specific AI agents, their tasks, and the practitioners using them. The same doctor might, for example, vibe with a set of diagnostic AI agents but roll their eyes at a different set of clumsy scheduling agents.

And the prime human element involves the patients. As clinics start offloading more tasks to AI agents, they should keep their patients front and center. That means being transparent about exactly what roles these AI agents play and treating them as teammates in the care process—not as replacements for human caregivers. Patients deserve to know when an AI agent is part of shaping their care decisions, insurance rejections, or any other healthcare-related task. And patients also ought to always have options to easily, quickly switch from an AI agent to a human provider—no questions asked.

AI Agents: Cure or Caution?

If we can untangle their many knots—technical glitches, regulatory mazes, and trust gaps—AI agents might foster healthcare that’s more proactive than reactive, more personalized than standardized, and far less cumbersome than today’s mess. They’re already proving their worth as powerful tools, streamlining some admin tasks for doctors, nurses, pharmacists, insurers, and hospitals without supplanting them (at least for now). Atlas’ seamless care journey shows what might become possible: a system where agents handle the grunt work, freeing humans to focus on healing.

But don’t hold your breath for cheaper bills. Even if AI agents trim costs, history suggests that providers and insurers will hoard the savings. A 2018 National Economic Bureau study by Cooper et al., for example, found that monopoly hospitals (those without a competitor within a 15-mile radius) charged 12.5% more than those in competitive markets even though technology advances in imaging slashed their unit costs—usage spikes, prices held, and patients saw little price relief while those hospitals’ profits swelled. Without strong competition or gutsy policy moves, AI agent efficiency gains are apt to just pad clinics’ and insurance companies’ bottom lines—not your pocket.

Still, there’s a win beyond cost. In places lacking specific medical specialists, like some rural areas, or communities facing language barriers, AI agents could fill care gaps—not as good as a human doc, of course, but they might be better than nothing when you can’t afford a doctor or the nearest specialist that you’d need to see is too far away.

Should we view AI agents as a cure or with caution? Probably both. AI agents might make healthcare better, fairer, and more convenient and accessible—but only if we revise the rusty system they’re stepping into instead of just slapping them on top and expecting miracles.