Last month, OpenAI unveiled HealthBench, a new dataset to evaluate how AI models handle healthcare-related questions.
And we are here with a critical evaluation of what this tool exactly is and will it really become the gold standard to trust medical AI.
The big question is: Can we count on AI to support real-world clinical decisions?
The generative healthcare AI market is expected to explode from $1.95 billion today to nearly $40 billion within a decade.
With that growth, the need for safe, transparent, and clinically validated AI tools is more urgent than ever.
And HealthBench could mark a turning point. Developed in collaboration with hundreds of physicians worldwide, it aims to hold AI to the same standards that doctors themselves are judged by.
In this article, we break down everything you need to know about HealthBench:
- What makes it different?
- How does it actually work?
- What do the early results tell us?
- What are experts praising—and where are they raising red flags?
- And most importantly, will this tool help us build truly trustworthy medical AI?
Let’s go!
Why we need better AI benchmarks in healthcare
There’s no doubt about AI’s potential. It is already making waves in medicine, helping clinicians diagnose faster, expanding access to information, and even assisting in surgeries.
As of August 7, 2024, the FDA had cleared over 950 AI-powered medical devices. This shows the swift rate at which these tools are being integrated into standard care protocols.
But with great potential comes great responsibility. For AI to truly deliver on this promise, it must be both highly effective and unquestionably safe.
However, current methods for evaluating AI in healthcare fall short. Many benchmarks skip real-world complexity, lack expert validation, or fail to push advanced models to their full potential.
That is where HealthBench comes in. OpenAI’s new clinically grounded benchmark designed to raise the bar for AI in medicine.
What exactly is HealthBench?
HealthBench is OpenAI’s first independent healthcare AI benchmark, and it’s built from the ground up with clinical safety in mind.
- Created with input from 262 physicians across six continents
- Features 5,000 realistic medical conversations ranging from simple patient questions to complex diagnostic support
- Comes with over 57,000 physician-authored assessment parameters
- Prioritizes how AI communicates, not just if it’s right, but how clearly, empathetically, and completely it responds
- Defines what an ideal AI response should (and shouldn’t) include, with weighted scores reflecting real-world clinical priorities.
“Our mission is to ensure AGI benefits all of humanity—and that means not just building powerful models, but ensuring they are safe and reliable in critical domains like healthcare.”
–Karan Singhal, Head of Health AI, OpenAI
In short: HealthBench isn’t just a test. It’s a new clinical standard for evaluating AI in medicine.
How HealthBench is built
HealthBench is not built just to test AI. It is designed to advance how we evaluate models in healthcare meaningfully. Here’s what laid the ground rules for building HealthBench:
Real-world relevance
This goes beyond textbook Q&As into real clinical scenarios, mirroring how patients and doctors interact with AI. The benchmark prioritises practical, high-impact use cases over artificial quizzes.
Doctor-approved trust
Every evaluation metric is rooted in physician judgment, ensuring scores reflect what truly matters in medical practice. If an AI performs well here, it means doctors would trust it in real life.
Room to grow
The bar is high but not fixed. HealthBench leaves ample space for AI models to improve, encouraging developers to keep striving for safer, smarter, and more reliable healthcare AI.
How HealthBench works

Realistic testing
HealthBench’s 5,000 multi-turn, multilingual conversations aren’t generic. They are intentionally stress-tested using adversarial techniques and generated scenarios that reflect actual patient-doctor interactions across specialities.
Scoring system
Each AI response is automatically evaluated by GPT-4.1, which checks it against the physician-created rubrics. It assigns points based on how well it meets clinical standards. The final score is a percentage of the maximum possible points, giving a clear and quantifiable measure of performance.
What are the initial findings
OpenAI tested its own models and pitted them against competitors like Google, Meta, Anthropic, and xAI. Here’s what emerged:

- OpenAI’s o3 model led the pack, especially in communication quality (clear, empathetic responses).
- But weak spots appeared across the board, like context awareness (missing subtle details) and completeness (skipping critical info).
- The benchmark also includes 1,000 “hard examples”, conversations where models consistently stumbled, creating a targeted challenge for future improvements.
While we’re getting closer, the models still need to figure out how to listen like a doctor and speak like one too.
What experts are saying about HealthBench
The response from the healthcare AI community has been overwhelmingly positive, but not without concern.
The Praise
HealthBench has earned widespread praise from healthcare AI researchers. They agree it’s a massive leap forward in aligning AI performance with clinical expectations.
Raj Ratwani (MedStar Health) called HealthBench a “credible and scalable” framework, predicting its wide adoption across organisations.
The Critics
The self-grading dilemma
Some experts, like Hao, question OpenAI’s approach to using its models to evaluate performance, calling the lack of independent verification “unacceptable for healthcare applications.”
Risk of amplified errors
Girish Nadkarni (Icahn School of Medicine at Mount Sinai) warns that relying on AI to grade AI could mask systemic flaws. If both the model and grader make the same mistake, it might go undetected.
The need for deeper validation
Researchers stress that subgroup analysis (by country, demographic, and speciality) and more human oversight are essential to prove these models are truly safe and effective for all patients.
What’s next: Building trust, not just scores
HealthBench is a promising step, but not the destination. For AI to be genuinely trusted in hospitals, clinics, and at home, it must go beyond great test scores. It must be:
- Transparently evaluated
- Human-verified
- Equitable across demographics
- Clinically validated—not just statistically
Ultimately, trust in medical AI won’t come from OpenAI, Google, or any benchmark. It will come when patients and doctors see real-world results that help—not harm—people.
Final thoughts: Raising the bar for Health AI
HealthBench doesn’t just measure performance, it sends a message.
As AI is the future of healthcare, the gold standard isn’t being good enough. It’s being trustworthy enough for the people whose lives are on the line.
By putting physicians in the driver’s seat, OpenAI is pushing the field forward. But it’s also reminding the world that safe, responsible, and human-centred AI should be the foundation of every innovation.
The next decade of healthcare AI will be defined by who earns trust and HealthBench may be the start of that accountability revolution.
-By Alkama Sohail and the AHT Team