In real clinical settings, decisions are rarely clean or binary.
A patient walks in with vague symptoms. Their history is incomplete. Time is limited. Anxiety fills the room—sometimes unspoken, sometimes overwhelming.
In moments like these, harm doesn’t always come from the wrong decision. It often comes from the missing one.
A test not ordered. A referral not made. A warning left unsaid.
As generative AI tools enter this fragile decision space, most conversations are fixated on performance. How well models answer exam-style questions, how fluent they sound, how confidently they speak the language of medicine.
But there’s a far more uncomfortable question we haven’t asked enough:
When AI is wrong in healthcare, how dangerous is it?
That’s the gap First Do NOHARM aims to address.
Rather than asking whether an AI model sounds clinically competent, NOHARM evaluates whether its recommendations, if acted upon, could cause real harm to patients. It represents a shift in how healthcare AI is evaluated.
What is NOHARM?
Numerous Options Harm Assessment for Risk in Medicine (NOHARM) is a framework for evaluating how AI systems perform in real-world medical decision-making scenarios.
It shifts the evaluation of medical AI away from intelligence and toward clinical consequences as it explicitly measures clinical safety.
The framework evaluates AI systems using:
- 100 real-world clinical cases across primary and specialty care
- 4,000+ possible medical actions (tests, treatments, referrals)
- Each action annotated by expert clinicians for appropriateness and potential harm
Rather than grading answers as simply right or wrong, NOHARM measures risk. It asks: Would this decision meaningfully endanger a patient?
That shift from correctness to consequence is what makes it different.
Who built it?
NOHARM was developed by a large, multidisciplinary team of clinicians and researchers from leading institutions, including Stanford University School of Medicine and Harvard Medical School.
The work was led under the ARiSE Healthcare Network, with contributions from clinicians and engineers such as Dr. Vishnu Ravi, Dr. Ethan Goh, Dr. Jonathan H. Chen, and others who helped build both the benchmark system and its interactive leaderboard.
The goal was to create an evaluation process rooted in authentic clinical practice rather than artificial vignettes or exam-style questions.
What was the purpose?
The primary motivation behind NOHARM was that the existing AI evaluation methods were insufficient for clinical safety.
Conventional benchmarks often assess AI based on knowledge recall or performance against standardised clinical questions.
But real-world medicine doesn’t work like a multiple-choice exam. It’s messy.
Patients show up with incomplete information. Comorbidities blur the picture. Uncertainty is the norm. And the cost of a wrong or missing decision can be life-threatening.
As Dr Ravi explains, with nearly two-thirds of U.S. clinicians using AI tools daily, having a direct impact on millions of patients, the question isn’t whether AI will influence care; it’s whether it will do so safely.
NOHARM was created to measure that risk directly, using real consultations rather than synthetic test prompts.
What do the results reveal?
Early results from NOHARM are revealing and concerning.
When 31 state-of-the-art LLMs were evaluated using NOHARM, severe harmful errors appeared in up to ~22% of cases.
What’s even more concerning is that most of these weren’t dramatic hallucinations. They were omissions. Failing to suggest a critical test, missing a referral, or overlooking a red flag.
Most importantly, NOHARM results showed that traditional benchmarks don’t predict safety very well. An AI model can ace medical exams and still generate unsafe recommendations in real clinical settings.
How reliable is this framework?
NOHARM’s reliability derives from its grounding in expert clinician annotation.
Each of the 4,249 possible management options in the benchmark is evaluated by a panel of board-certified clinicians, providing granular judgments about harm severity and clinical appropriateness.
This depth and diversity of expert input are what give NOHARM its credibility as a safety assessment tool. Far beyond simplistic “right/wrong” scoring used by many medical AI benchmarks.
Who supervises NOHARM?
While NOHARM was stewarded by the academic research team and affiliated institutions, it is not a regulatory standard overseen by healthcare authorities. At least not yet.
As a benchmark and evaluation framework, its supervision currently lies with the research collaborators. It is a part of their broader effort to develop the Medical AI Superintelligence Test (MAST). A suite of clinical benchmarks to help AI developers assess real-world safety performance.
Why do we need NOHARM in healthcare right now?
Healthcare AI adoption is moving faster than its safety guardrails.
From documentation assistants to decision support, clinicians are increasingly relying on AI outputs in high-stakes scenarios.
However, literature and experience show that models optimised for fluency can still produce clinically harmful guidance unless their safety profiles are explicitly assessed.
NOHARM offers a practical safety lens. One that evaluates AI not by how smart it sounds, but by how safely it behaves when the stakes are high.
For health systems, developers, and regulators, it invites a shift from generative AI benchmarks toward harm-based evaluation, a metric far more relevant to patient care.
What do experts say about it?
Leading voices in healthcare AI echo the significance of NOHARM.
Clinicians and data scientists emphasise that performance in knowledge recall is not the same as clinical safety, and benchmarks like NOHARM. Assessing error severity and omission is essential for responsible deployment.
Experts also note that multi-agent AI strategies, where different systems review each other’s outputs, may reduce harm and improve safety scores.
Wrapping up
NOHARM doesn’t argue against AI in healthcare. It argues for better accountability. In doing so, it shifts the focus away from surface-level intelligence and toward clinical consequences.
It shows that harm often isn’t loud or obvious. It’s quiet. It hides in omissions, in missed follow-ups, in decisions that feel reasonable but aren’t safe.
By making those risks visible, NOHARM moves the conversation from performance to protection, from intelligence to impact.
And that shift may be one of the most important steps toward truly responsible healthcare AI.
-By Rohini Kundu and the AHT Team