When 97% accuracy hides 33% uncertainty: What claim-level verification reveals about AI diagnostics
New research from Diadia Health evaluating 3,000 biomedical claims across four frontier LLMs
Diadia evaluated 3,035 biomedical claims across four frontier AI models. The hallucinations were rare. The inference presented as proof was the larger problem.

A patient comes in with a complex chronic presentation. You ran the full panel — labs, metabolics, genetics. You fed the data into your AI. It came back with a clear narrative: root drivers identified, mechanisms named, protocol outlined. You signed off on it.
Here's the question nobody is asking: how many of those mechanistic steps did the AI actually prove?
Diadia ran that analysis. Across 3,035 individual biomedical claims generated by four frontier AI models on 36 real patient cases, the hallucination rate came back at 2.7%. By every standard benchmark, that's a pass. The models look safe. But hallucination is the wrong thing to measure.
Once we decomposed each claim into its constituent mechanistic steps and verified every step independently against the literature, a different number emerged: 32.9% of all claims — nearly one in three — could not be traced to a complete, verified evidence chain with RCT data supporting each step. The reasoning was biologically sound. The evidence pointed in the right direction. But at least one step in the chain rested on inference, not a direct clinical trial confirming that specific connection.
Those claims arrived in the model output looking identical to the ones with full randomized controlled trial support. Same formatting. Same confidence. No flag.
We called this tier Plausible — and it is the finding the field isn't talking about.
Precision and functional medicine runs on biologically grounded inference. Genotype-intervention relationships, emerging supplement applications, novel mechanistic connections in complex chronic cases — much of the most clinically meaningful reasoning in this specialty lives in the Plausible tier. That is not a problem. Treating it as proven, or rejecting it because no RCT closes the chain end-to-end, are both the wrong response. Naming it precisely is the only one that holds up. Current AI systems don't name it. They present it as Supported and move on.
Cross-model agreement across the four models — GPT-5.2, Gemini 3.1 Pro, Claude Sonnet 4.6, Claude Opus 4.6 — came back at a Fleiss' kappa of −0.045. Statistically indistinguishable from random chance. Only 3 of 36 patients received consistent clinical narratives across all four models. The model a clinician uses is not a neutral choice. It determines which root causes get surfaced, which mechanisms get implicated, and which protocol gets written. A clinician signing off on AI-generated analysis isn't just choosing a tool. They're choosing a reasoning set — and right now, they can't see it.
Diadia's transparency engine sits above the LLM layer. Every claim is decomposed into a directed graph of mechanistic steps. Every edge in that graph is independently verified and labeled — Supported by Science, Plausible, or Unsupported — by deterministic rule, not by the model deciding how confident to sound. The same inputs produce the same output every time. A clinician can trace any conclusion back through each reasoning step to the specific evidence behind it. That traceability is what separates a recommendation you can defend from one you have to trust.
The full paper covers per-model breakdowns, patient-level heatmaps showing where evidence gaps concentrate, hallucination taxonomy by mechanism type, and the complete methodology. It's the audit-grade reference we use internally — not a vendor benchmark.
© 2026 Diadia. All rights reserved.
© 2026 Diadia. All rights reserved.