Home / Tech & Innovation / Study Finds General AI Models Fail in Medical Diagnostics

Study Finds General AI Models Fail in Medical Diagnostics

Mar 18, 2026 Interview

Lukas HainzBiopharma Innovation Specialist

Ivan Kairatov brings a wealth of experience from the front lines of biopharma research and development, where the intersection of technology and biology is fundamentally reshaping patient care. His deep understanding of how innovation transitions from the laboratory to the clinical bedside makes him a vital voice in the debate over the role of artificial intelligence in modern medicine. As healthcare systems weigh the immense promise of generative AI against the rigorous, uncompromising standards of diagnostic imaging, Kairatov provides a grounded perspective on why linguistic fluency should never be mistaken for clinical expertise. In the following discussion, he explores the critical boundaries between general-purpose language models and the specialized algorithms currently flagging life-threatening pathologies in hospitals worldwide.

Specialized AI tools currently assist with retinal scans and lung cancer detection. How do these purpose-built algorithms differ from general-use large language models regarding training data and diagnostic reliability, and what are the specific risks of using conversational tools for clinical imaging?

The architectural difference between a task-specific algorithm and a large language model is essentially the difference between a master radiologist and a charismatic storyteller. Specialized AI tools are forged in highly controlled environments where they digest millions of precisely categorized medical images, allowing them to flag subtle irregularities like diabetic eye disease or early-stage lung cancers with remarkable precision. In contrast, large language models are fundamentally optimized for linguistics and conversation rather than clinical accuracy, which creates a dangerous “authority gap.” The recent research revealing a 20 percent rate of fundamental diagnostic errors among these models highlights a terrifying reality: while an LLM might identify a CT scan correctly, it lacks the deep, validated training necessary to interpret the nuances of human pathology. When a conversational tool generates an explanation that sounds authoritative but is factually incorrect, it risks leading a clinical team down a path of false confidence that the data simply does not support.

An automated system might misclassify an ischemic stroke as a hemorrhage on the opposite side of the brain. What are the immediate clinical consequences of such a fundamental diagnostic error, and what safeguards must be in place to prevent “authoritative” but incorrect interpretations from reaching a patient?

When an automated system misidentifies an ischemic stroke—caused by a blockage—as a hemorrhage, which involves active bleeding, the clinical stakes could not be higher because the treatments for these conditions are diametrically opposed. Administering powerful blood thinners to a patient with a hemorrhage based on an incorrect AI reading would be catastrophic, potentially turning a manageable emergency into a fatal event. In the study, we saw one model even hallucinate the location of the pathology on the completely opposite side of the brain, a mistake that feels almost surreal to a trained professional but is a known risk of generative logic. To prevent these authoritative but false narratives from reaching the bedside, hospitals must implement ironclad validation protocols where medical experts act as a non-negotiable filter for every single diagnostic output. We cannot allow the speed of AI to bypass the sensory check of a human physician who can feel the weight of a patient’s history and the physical reality of the scan.

When multiple AI models evaluate the same scan, they often disagree on calcification, timing, and alternative diagnoses. How does this lack of consensus impact a physician’s workflow, and what challenges arise when these models are asked to “grade” one another’s performance?

The lack of consensus among AI models introduces a heavy cognitive load on physicians, who must then reconcile conflicting reports on critical details like calcification or the exact timing of a vascular event. In recent testing, even the four models that correctly identified the stroke couldn’t agree on the surrounding brain regions affected or the alternative diagnoses, which forces the doctor to spend more time debunking the AI’s “opinions” rather than focusing on the patient. The most striking finding was the failure of cross-evaluation, where models were asked to grade each other’s explanations. One model wrongly insisted the findings showed chronic brain abnormalities rather than an acute stroke and, as such, systematically penalized the other models for being correct. This “hallucination loop” proves that we cannot yet rely on AI to provide its own quality assurance, as their grading criteria remain as inconsistent and subjective as their initial assessments.

Language models seem better suited for administrative tasks like summarizing reports or clinical documentation than for primary diagnosis. How should hospitals divide labor between specialized diagnostic AI and general LLMs, and what step-by-step protocols ensure that a medical expert remains the final decision-maker?

The most efficient hospital of the near future will likely treat large language models as high-tier administrative assistants while leaving the heavy lifting of image analysis to specialized diagnostic systems. General-use models are surprisingly adept at summarizing long-form reports, handling clinical documentation, or translating complex medical jargon for patient communication, which frees up the medical team to focus on direct care. However, the protocol must be rigid: any diagnostic interpretation generated by an AI must be flagged for manual review by a licensed radiologist before it ever enters the electronic health record. This ensures that while the technology speeds up the administrative workflow, the final “clinical sign-off” remains a human responsibility. By keeping the LLM in the role of a scribe rather than a specialist, we protect the patient from the 20 percent error margin that currently plagues these systems.

What is your forecast for the integration of multimodal AI in clinical radiology?

My forecast for clinical radiology is a move toward a “hybrid ecosystem” where multimodal AI becomes a standard layer in the diagnostic stack, but with much narrower guardrails than the open-ended systems we see today. We will likely see specialized algorithms handling the raw data interpretation while language models act as the interface, synthesizing that data into readable reports for physicians and patients. However, the next several years will be defined by a “reality check” phase where healthcare providers realize that linguistic fluency does not equal clinical expertise, leading to much stricter regulatory oversight. Ultimately, the successful integration of these tools will depend not on how well they can chat, but on how seamlessly they can be subordinate to the life-saving intuition and oversight of a human doctor.

Study Finds General AI Models Fail in Medical Diagnostics

Related Publications

Subscribe to our weekly news digest.