Your AI Gave You a Confident Answer. One in Three Steps Behind It Aren't Proven.
Diadia evaluated 3,035 biomedical claims across four frontier AI models. The hallucinations were rare. The inference presented as proof was the larger problem.
AI scribes returned the documentation hour. The synthesis hour they were never built to touch — reading multi-omic data before a complex case — runs on a different layer of AI entirely.

A patient with a complex chronic presentation is on your schedule tomorrow, and the work that decides her visit happens tonight. You will spend 60 minutes reading her panel, integrating the DNA report, mapping the metabolic and gut data, and drafting the protocol before she ever sits down in your office. Your AI scribe will not touch a minute of it. The scribe runs during the visit, after the reasoning is already done.
You signed off on the scribe last year, and it earned its place. Documentation dropped from an hour to a few minutes. But the hour it gave back is not the hour holding your practice back, and the question worth asking is which hour your AI is actually working on.
AI scribes solved a documentation crisis physicians had absorbed for years as unpaid overtime. A 2017 retrospective cohort of 142 family physicians in Annals of Family Medicine found primary care physicians spend 355 minutes of an 11.4-hour workday inside the EHR, 86 of those minutes after clinic hours. That after-hours block is what the AMA's Christine Sinsky named pajama time: the charting, the inbox, the orders, the prior-auth replies.
The consequence shows up in the burnout data. Tebra's 2025 Physician Burnout Survey ranked documentation among the top drivers, cited as the number one contributor by 16% of providers, where it tied with difficult patients, and by 26% of primary care physicians. Ambient scribes have moved that number. A 2025 Phyx Primary Care report on 116 providers found 60% fewer reported burnout after adopting an ambient scribe. The category earned its adoption. It simply solved the wrong hour for a practice trying to scale.

A 30-minute primary care visit produces roughly 10 minutes of post-visit charting, and a scribe collapses that to under a minute. Fixed input, fixed output, time saved on every visit. A complex case is structurally different. The clinician is reading a comprehensive lab panel, a DNA report, a gut microbiome assay, a metabolic workup, often a four-day food log. That pre-visit synthesis routinely runs 60 to 90 minutes per patient, and the scribe arrives too late to touch any of it.
The scribe solved the wrong hour. A clinic spending two hours per patient across ordering, synthesis, the visit, and protocol writing can scale only three ways: add headcount, work longer hours, or cap the panel. None of those fits a clinician whose whole model is scaling the judgment behind the system. Documentation was the cheap hour. Synthesis is the expensive one, and the scribe does not return it.
Synthesis is not transcription. It is mechanistic reasoning across interdependent datasets: pattern recognition across panels, mechanism inference, hypothesis ranking, protocol scaffolding. A single SNP changes how a metabolite reads, which changes how a lab value is interpreted, which changes the protocol that follows.

Hallucination at this layer carries a different category of risk. A 2025 review in npj Digital Medicine notes that ambient scribes built on large language models report hallucination rates of roughly 1% to 3%, with added risk from omissions and contextual misinterpretations. A misspelled medication in a SOAP note is recoverable. A fabricated mechanism inside a clinical recommendation is not. A clinic running on the physician's name and judgment cannot absorb that exposure.
A documentation-layer AI only has to be accurate. A synthesis-layer AI has to clear a higher bar, on three fronts. Deterministic: the same inputs produce the same output every time, so two runs of one case don't return two different protocols. Auditable: the reasoning chain stays visible enough that a clinician can trace it, agree with it, or overrule it at any step. Built around the practice: multi-omic data is read as one biological system, in line with how the clinician already works, rather than handed back as separate datasets to integrate alone.
That is a different class of system, not a feature bolted onto a scribe. Diadia's causal AI sits above the LLM layer to do exactly this. Each claim is decomposed into a directed graph of mechanistic steps. Each edge is independently verified against the literature and labeled Supported by Science, Plausible, or Unsupported by deterministic rule. The clinician can trace any conclusion back through each step to the specific evidence behind it.

The practice economics is where the layer choice ultimately lands, but the dollars are a longer conversation than this piece. What matters here is simpler: a documentation layer gives back minutes, and a synthesis layer gives back the expensive hour a scribe was never built to touch. What layer is your AI actually working at?
Once you can name the layer, the next question is what it's worth. For the practice economics — the hours a synthesis layer recovers and the capacity it frees — read the companion piece: Off the clock, off the P&L: Your best clinician works 14 hours a week for free.
Can an AI scribe help with complex, multi-omic cases?
Not the part that takes the time. An ambient scribe documents the visit while it happens. The 60 to 90 minutes of pre-visit synthesis — reading the panel, DNA, gut, and metabolic data and building the protocol — is finished before the scribe ever turns on.
What is the difference between an AI scribe and a clinical reasoning AI?
A scribe works at the documentation layer, transcribing and structuring what is said in the room. A clinical reasoning, or synthesis, layer interprets interdependent datasets to surface mechanisms and draft protocols. One records the visit, the other does the analysis that precedes it.
Are AI scribes accurate enough for clinical recommendations?
Scribes report hallucination rates around 1% to 3%, which a clinician can catch when proofreading a note but cannot afford in an unaudited recommendation. A synthesis layer has to be deterministic and traceable to evidence at every step, a higher standard than transcription accuracy.
© 2026 Diadia. All rights reserved.
© 2026 Diadia. All rights reserved.