How genAI stacks up against medical expert systems in clinical diagnosis

August 8, 2025

A study compares DXplain, ChatGPT-4, and Gemini for clinical diagnosis, revealing surprising strengths and future possibilities.

GenAI vs medical expert systems in clinical diagnosis

AI in medicine isn’t new; it’s just being reimagined. Long before ChatGPT or Gemini entered the scene, AI was already at work in healthcare.

In fact, for over 40 years, clinicians have used rule-based expert systems like DXplain to generate differential diagnoses, assist in medical training, and support decision-making in complex cases. These systems relied on structured clinical knowledge, not internet data or language prediction.

However, now, generative AI models designed to process and generate natural language are making headlines for their surprising ability to reason through medical cases.

ChatGPT-4 and Gemini 1.5, for instance, have performed impressively in tasks that mimic clinical reasoning, even generating potential diagnoses from free-text case descriptions.

But these models weren’t built for diagnosis. They were built to understand and mimic language.

So, where does that leave the older, more specialised AI systems that were ACTUALLY designed for medical diagnostics?

That’s exactly what a new study published in JAMA Network Open set out to answer. And it found traditional systems performing better than modern updated ones!

Here’s everything about the study:

What was the study about?

The research compared the performance of three tools:

DXplain: A tried-and-tested diagnostic decision support system (DDSS) developed decades ago, widely used in medical education even now.
ChatGPT-4: A state-of-the-art generative AI developed by OpenAI.
Gemini 1.5: Google DeepMind’s latest large language model (LLM).

These models were tested on 36 complex, unpublished medical cases. The kind that aren’t available online and require nuanced interpretation. Each model received the same clinical input, and a panel of expert clinicians graded the quality and relevance of the diagnoses generated.

“Amid all the interest in large language models, it’s easy to forget that the first AI systems used successfully in medicine were expert systems like DXplain.”
-noted study co-author Dr. Edward Hoffer of Massachusetts General Hospital.

This study served as a much-needed reality check. Can today’s shiny new tools outperform the old but reliable systems that have earned clinicians’ trust?

What did the GenAI vs medical expert system study find?

The results were eye-opening. It wasn’t a simple win or loss for either side:

DXplain had the highest overall diagnostic accuracy.
DXplain consistently ranked the correct diagnosis in the top 1, 2, or 3 positions more frequently than either LLM.
ChatGPT-4 performed better than Gemini, but both struggled with less common or more complex cases.
LLMs showed strong linguistic fluency but lacked the domain-specific structure of medical knowledge that DXplain was built upon.

The structured and rule-based clinical logic sets DXplain apart. On the other hand, LLMs excelled in language fluency and reasoning, but missed the mark on medical specificity.

GenAI vs medical expert system in clinical diagnosis — Source: JAMA Network Open study

But what was the real takeaway from this study?

While DXplain showed its strength in structure and precision, LLMs impressed with their narrative reasoning and flexibility. That contrast sparked that the future isn’t about choosing one tool over the other, it’s about integrating their strengths.

“Combining the powerful explanatory capabilities of existing diagnostic systems with the linguistic capabilities of large language models will enable better automated diagnostic decision support and patient outcomes.”
-said corresponding author Mitchell Feldman, MD, also of MGH’s LCS.

Why does this study matter now?

Healthcare systems today are strained. Clinicians are burnt out with mounting pressure, diagnostic errors, and the rising complexity of care. There’s a need for smarter tools that can lighten the load.

However, clinicians have been cautious about trusting generative AI in diagnosis.

This study didn’t just pit two generations of AI against each other. It bridged a trust gap by:

Testing AI in real-world clinical contexts, not synthetic exam-style prompts.
Comparing old vs. new under the same conditions.
Including expert review, not just metrics like “top-3 accuracy.”

More importantly, it highlighted how both systems can complement each other, not compete.

So, should doctors be worried?

Not at all. The study isn’t about replacing physicians. It’s about giving them better tools.

AI, whether it’s DXplain or ChatGPT, is like a second opinion that never sleeps. It can catch rare conditions, prompt doctors to think of overlooked possibilities, and support decision-making in high-stress situations.

But human expertise, empathy, and contextual judgment are still irreplaceable.