Technical Blogs

Your AI Gave You a Confident Answer. One in Three Steps Behind It Aren't Proven.

Diadia evaluated 3,035 biomedical claims across four frontier AI models. The hallucinations were rare. The inference presented as proof was the larger problem.

An arrow striking the center of a target while fragments break away, illustrating AI claims that hit the mark versus those that rest on unproven inference.

A patient comes in with a complex chronic presentation. You ran the full panel — labs, metabolics, genetics. You fed the data into your AI. It came back with a clear narrative: root drivers identified, mechanisms named, protocol outlined. You signed off on it.

Here's the question nobody is asking: how many of those mechanistic steps did the AI actually prove?

Diadia ran that analysis. Across 3,035 individual biomedical claims generated by four frontier AI models on 36 real patient cases, the hallucination rate came back at 2.7%. By every standard benchmark, that's a pass. The models look safe. But hallucination is the wrong thing to measure.

The problem isn't what the AI made up. It's what it presented as settled.

Once we decomposed each claim into its constituent mechanistic steps and verified every step independently against the literature, a different number emerged: 32.9% of all claims — nearly one in three — could not be traced to a complete, verified evidence chain with RCT data supporting each step. The reasoning was biologically sound. The evidence pointed in the right direction. But at least one step in the chain rested on inference, not a direct clinical trial confirming that specific connection.

Those claims arrived in the model output looking identical to the ones with full randomized controlled trial support. Same formatting. Same confidence. No flag.

We called this tier Plausible — and it is the finding the field isn't talking about.

Plausible is not a flaw. Unlabeled Plausible is.

Precision and functional medicine runs on biologically grounded inference. Genotype-intervention relationships, emerging supplement applications, novel mechanistic connections in complex chronic cases — much of the most clinically meaningful reasoning in this specialty lives in the Plausible tier. That is not a problem. Treating it as proven, or rejecting it because no RCT closes the chain end-to-end, are both the wrong response. Naming it precisely is the only one that holds up. Current AI systems don't name it. They present it as Supported and move on.

The cross-model finding made this harder to ignore.

Cross-model agreement across the four models — GPT-5.2, Gemini 3.1 Pro, Claude Sonnet 4.6, Claude Opus 4.6 — came back at a Fleiss' kappa of −0.045. Statistically indistinguishable from random chance. Only 3 of 36 patients received consistent clinical narratives across all four models. The model a clinician uses is not a neutral choice. It determines which root causes get surfaced, which mechanisms get implicated, and which protocol gets written. A clinician signing off on AI-generated analysis isn't just choosing a tool. They're choosing a reasoning set — and right now, they can't see it.

What defensible AI looks like at the claim level

Diadia's transparency engine sits above the LLM layer. Every claim is decomposed into a directed graph of mechanistic steps. Every edge in that graph is independently verified and labeled — Supported by Science, Plausible, or Unsupported — by deterministic rule, not by the model deciding how confident to sound. The same inputs produce the same output every time. A clinician can trace any conclusion back through each reasoning step to the specific evidence behind it. That traceability is what separates a recommendation you can defend from one you have to trust.

The full paper covers per-model breakdowns, patient-level heatmaps showing where evidence gaps concentrate, hallucination taxonomy by mechanism type, and the complete methodology. It's the audit-grade reference we use internally — not a vendor benchmark.

Download the full paper →