A new study published in the New England Journal of Medicine shows that while large language models can outperform medical students on multiple-choice exams, they continue to struggle with real-time clinical reasoning. Researchers from the University of Alberta, Harvard and MIT found that popular AI systems often fail to update diagnoses when new information emerges and have difficulty recognizing irrelevant details.
Using a validated “script concordance” testing framework — designed to evaluate how clinicians weigh uncertain and evolving data — AI models performed at the level of junior trainees. None matched senior resident or attending physician expertise. The study also found a consistent pattern of overconfidence, with AI models justifying incorrect conclusions and assigning clinical meaning to unrelated information.

