When 97% accuracy hides 33% uncertainty: What claim-level verification reveals about AI diagnostics

New research from Diadia Health evaluating 3,000 biomedical claims across four frontier LLMs

We ran the claim-level transparency analysis of AI-generated diagnostic reports. Four frontier LLMs — Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.2, and Gemini 3.1 Pro — each generated diagnostic reports for 36 patients with real biomarker data. Our transparency engine extracted 3,035 individual biomedical claims and verified each one against the scientific literature, step by step.

While outright hallucinations were rare (2.7% of claims), a far larger share of clinical reasoning (30.2%) fell into a "plausible grey zone" where the biology sounds right but the full evidence chain can't be verified. That means roughly one in three claims generated by today's best models lacks complete scientific backing, even though nothing about those claims signals uncertainty to the reader.

Download the full paper →

The trust problem

When a clinician or patient receives an AI-generated diagnostic report, every claim arrives with equal confidence. The model doesn't distinguish between a well-established finding ("elevated HbA1c is associated with insulin resistance") and a mechanistic leap that sounds reasonable but isn't fully supported in the literature.

In our evaluation, 97.3% of claims would pass a simple hallucination check. That number sounds reassuring. But it obscures a much larger issue: when we decomposed each claim into its underlying mechanistic steps and verified each step independently, we found that 32.9% of all claims — nearly one in three — could not be traced to a complete, verified evidence chain.

This isn't a problem you can see from the outside. Without step-by-step mechanistic verification, a plausible claim looks identical to a proven one.

The grey zone of plausibility

This is the finding we think matters most.

Of 3,035 claims evaluated, 915 were classified as plausible, biologically reasonable assertions where most of the mechanistic pathway checks out, but at least one critical step lacks direct evidence. These aren't fabrications. A clinician reading them would likely nod along. They follow established physiological logic. But when you trace the reasoning to its source, a gap appears.

Consider a claim like "low vitamin D impairs thyroid hormone conversion through reduced deiodinase activity." Each piece of that chain has some basis in biology. But whether the full pathway holds, from a specific vitamin D level, through a specific enzymatic mechanism, to a specific clinical outcome, may not be well established for the context in which the model asserts it.

The distribution varied by model. Gemini 3.1 Pro placed 39.6% of its claims in this grey zone: nearly two in five. Even GPT-5.2, the strongest performer, had 24.4% of its claims in plausible territory. The grey zone is where a large share of AI clinical reasoning actually lives.

This is what standard evaluation misses. A hallucination rate of 2.7% creates a sense of safety. Adding the 30.2% plausible layer back in tells a very different story about how much of AI-generated clinical reasoning is actually proven versus merely reasonable.

Models disagreement

A second finding that surprised us: cross-model agreement was effectively zero.

Fleiss' kappa across the four models was −0.045, meaning they agreed on clinical interpretations less than random chance would predict. Only 3 of 36 patients (8.3%) showed consensus across models. The average divergence in support-level distributions was 31 percentage points per patient, and over 85% of claims were unique to a single model.

In practical terms, this means a patient would receive a fundamentally different clinical narrative depending on which AI model produced their report, not just different wording, but different reasoning, different root causes, and different recommendations.

Why standard AI evaluation doesn't solve this

Most AI safety evaluation in healthcare operates at the report level or the benchmark level. Did the model get the diagnosis right? Did it match expert consensus on a set of clinical vignettes?

These approaches miss the claim-level problem for three reasons:

Aggregate metrics mask per-claim risk. A report can be broadly correct while containing one or two unsupported mechanistic assertions. In our data, even the best-performing model averaged 0.3 unsupported claims per report — enough to accumulate meaningful risk across a patient population.

Binary evaluation (right/wrong) can't capture the plausible middle. Standard hallucination detection asks whether a claim is true or false. But the clinically important question is whether the entire reasoning chain is verified. A claim can be "not wrong" and still rest on incomplete evidence. Without a third category, this uncertainty is invisible.

Single-model evaluation doesn't reveal disagreement. If you test one model on one task, you see its output. You don't see that another model would have told the same patient something quite different. The near-zero cross-model agreement we found suggests that model selection alone introduces substantial variability into clinical AI outputs.

How Diadia approaches this differently

The transparency engine at the center of this study is the same system that powers every root cause analysis on Diadia's clinical platform. It works by decomposing each biomedical claim into a directed graph of mechanistic steps — biomarkers, physiological processes, conditions, and interventions connected by specific causal or associative relationships. Each edge in that graph is then independently verified against the scientific literature.

This is what makes the three-tier classification possible. A claim is supported when every mechanistic step has direct evidence. It's plausible when most steps check out but one or more rest on biological reasoning rather than direct verification. It's unsupported when a critical step is contradicted or unverifiable.

The result is full traceability: from any conclusion in a diagnostic report, back through each reasoning step, to the specific evidence that supports or challenges it. A clinician reviewing a Diadia report can see not just what the AI concluded, but exactly which parts of that conclusion are proven, which are reasonable inferences, and which are gaps.

This is what we believe clinical AI transparency actually requires — not just knowing how often the model is wrong, but knowing exactly where and why any given claim is or isn't fully supported.

Read the full paper

This blog post covers the highlights. The full paper includes per-model breakdowns, patient-level heatmaps, evidence quality analysis, hallucination taxonomy by mechanism type, and the complete prompt template used across all four models.

Download: Claim-Level Transparency Analysis of LLM-Generated Diagnostic Reports (PDF) →