AI Agents for Healthcare: What Works and What Obstacles May Come

What's Already Working?
Obstacles to Deploying Healthcare AI Agents
Design and Implementation Challenges

Share this guide

By Brad Nikkel

AI Content Fellow

Last Updated

Sep 3, 2025

Yes, there exists a ton of hype about healthcare AI. Yes, there already are incredible medical transcription models embedded and integrated into medical companies’ systems right now. Yes, there are AI agents currently in place across various domains—from medicine to retail.

However, there still remains room for improvement. After all, we can always improve accuracy, efficiency, and cost—both in technical systems and human-run systems. So let’s take a look at what’s currently working and what obstacles we may face in the future:

What's Already Working?

Atlas’ story showcased AI agents already in use in some clinics. Notice that the types of tasks they handled often involved:

Clear Goals, Fuzzy Execution: AI Agents excel at well-defined tasks that are too sprawling or variable for simple automation.
24/7 Attention: AI agents don’t need to sleep or take breaks, making them suited for continuous monitoring.
Multiple Parties: AI agents can hand off data across various stakeholders.
Repetitive, Variable Work: AI agents shine where each case requires slight variations, rather than cookie-cutter tasks.

The sweet spot is tasks too complex for traditional automation but too time-consuming for continuous human attention. That's why we’re already seeing administrative AI agents leading the way in healthcare automation—tackling problems with too many alternatives and exceptions to feasibly hardcode.

The State of Voice Report indicates that the year 2025 is the year of the AI voice agent for this very reason. Companies are either adopting AI agents or expanding their budgets to integrate them. As a result, we know that our AI agents work well enough to improve efficiency across various domains—from medicine, to retail, to law, and more

Obstacles to Deploying Healthcare AI Agents

Despite a growing number of success stories—like the handful highlighted in Atlas’ journey—scaling AI agents across healthcare still faces non-trivial engineering, regulatory, and ethical barriers. Healthcare-related tasks vary so wildly, and best practices for AI agents are still so nascent, that we’re bound to bumble through many thickets of trial and error before we even begin to grasp the hiccups awaiting. But we can foresee some of the hurdles.

Design and Implementation Challenges

First off is designing and building medical AI agents. Even without hands-on experience, you can envision some of the difficulties involved.

How Many Agents? And How Should They Interact?

A single, narrowly scoped agent can probably handle relatively straightforward tasks—like refuting a denied insurance claim. But more complex tasks, like diagnosing patients, will likely require entire teams of domain-specific agents. This mirrors healthcare itself, where multiple specialists collaborating on complex cases tend to produce better outcomes than a lone generalist. Similarly, we know that groups of specialized AI agents tend to outperform lone, more general models. This raises the question: when we decide to throw multiple AI agents at a healthcare task, how do we optimally design their interactions?

Some tasks lend themselves to hierarchical decision-making; some tasks clearly need a human-in-the-loop; some tasks benefit from multi-agent debate (where numerous agents challenge each other's reasoning until they converge on a consensus, which improves accuracy and reduces hallucinations). And that's just scratching the surface—many tasks will probably require even more sophisticated arrangements of AI agents.

Possible AI agent configurations are perhaps as diverse as human team configurations—meaning no single blueprint will ever suit every healthcare scenario. Each task will warrant tweaking and testing to determine its ideal multi-agent setup.

This, alone, is a substantial engineering challenge. And yet no matter how well we orchestrate inter-agent interactions, their Achilles’ heel is their underlying LLMs—prone to confusion by medicine’s specialized, ever-evolving language.

Image Source: Du et al. 2023 (Traditional single agent versus multi-agents debating one another's performance across six benchmarks)

Handling Rare Data

Medicine brims with obscure jargon, specialized acronyms, and the dense, so-specific-its-confusing fine print of medical insurance policies. Such niche terms are underrepresented in general LLM training data, leaving LLMs flailing about when they encounter such long-tail data. While RAG can help out here, it's no panacea.

How can we ensure that our datasets capture enough rare, out-of-distribution terms to prevent specialized AI agents from faltering during critical situations—especially when new medical terminology emerges as rapidly as medicine innovates? Assuming we figure that out, another battle remains: AI agents must operate in a fragmented healthcare software landscape, plagued by interoperability barriers.

Legacy Systems and Data Challenges

Many difficulties stem from clinics’ existing software. Should AI agents, for example, be designed for single-platform solutions—like Epic-only systems—or should they aim for cross-platform interoperability? Medical SaaS vendors have clear incentives to remain proprietary, and hospitals seldom overhaul their EHRs, making narrower, system-specific agents the more likely path. This platform exclusivity not only confines AI agents to narrow ecosystems but also exacerbates a longstanding challenge: healthcare’s tangled web of fractured legacy software comes with chronic interoperability issues.

Remember MYCIN, that 1970s medical "expert system" that we mentioned earlier? Surprisingly, one of its biggest shortcomings didn’t lie within its diagnostic or therapeutic logic—it performed reasonably well there. The real struggle was getting MYCIN to jive with existing hospital systems. More than fifty years later, healthcare organizations still wrestle with similar integration challenges—only now with more advanced tech.

Deploying healthcare AI agents to production is messy for several reasons. They must communicate with a kaleidoscope of systems—EHRs, billing platforms, insurance portals, pharmaceutical databases, imaging systems, lab information systems, and more. Even different departments within the same facility sometimes use different and incompatible software.

In theory, Fast Healthcare Interoperability Resources (FHIR) was intended to standardize healthcare data exchange to free the flow of information. In practice, its adoption remains uneven. One problem is that healthcare providers still operate legacy systems that predate modern interoperability efforts—some haven't been updated since the Clinton era. Another issue is that even clinics that adopt FHIR often customize it via a process called “profiling”—where standard FHIR data structures are adjusted or extended to accommodate local workflows and legacy systems. While this flexibility, a deliberate design feature of FHIR, helps clinics tailor their system to their unique needs, it also inadvertently recreates the fragmentation FHIR was meant to overcome, undermining its promise as a 'universal' standard.

And since each institution enforces its own data models, access rules, and security protocols, AI agents will struggle to process multiple FHIR resources—like patient records, observations, and medications—in near real time. Every API call adds latency, and on older platforms, like the MUMPS-based VistA still used in VA hospitals, integrating AI agents will require middleware, adding additional points of failure and performance bottlenecks.

LLMs’ function calling abilities are steadily advancing, offering real potential to streamline AI agent-API interactions and possibly ease some of healthcare’s stubborn interoperability challenges. Yet, there’s a hitch: function calling often struggles with legacy or proprietary medical APIs, which are like strangers to LLMs—rarely seen in their training data and, in many cases, entirely absent. This gap raises concerns about the overall effectiveness of AI agents that rely heavily on function calling.

Ultimately, successfully deploying healthcare AI agents demands more than just technical know-how; it also requires deep domain knowledge of healthcare’s complex workflows, dated infrastructure, and strict compliance requirements. Even the most advanced AI agents will stumble unless the engineers who implement them are fluent in healthcare’s quirks.