AI Model Outperforms ER Doctors in Diagnosis Study

By Artūras Malašauskas May 01, 2026 4 min read Share:

A new study in Science shows OpenAI's reasoning model matched or exceeded physician diagnostic accuracy in emergency room scenarios, though experts warn it won't replace human clinicians.

A new study published in the journal Science suggests artificial intelligence may be ready for the emergency room — at least for the diagnostic portion of patient care. Researchers found that OpenAI's o1-preview reasoning model could diagnose cases as well as, or better than, actual physicians during simulated emergency department interactions.

The research, led by Dr. Adam Rodman at Beth Israel Deaconess Medical Center in Boston, tested the AI at three critical points in patient care: initial triage, doctor examination, and admission decisions. The model analyzed electronic health records and symptom data without any real-time doctor-patient interaction. In some instances, it identified the exact or very close diagnosis more accurately than the physicians who participated in the trial.

According to reporting from CBC News, the study used both real patient cases and synthetic cases with "unstructured" data from emergency department records. The goal was to mirror the high-stakes decisions doctors and nurses make in real ER environments.

What makes this different from previous AI attempts is the use of a reasoning model. Unlike standard large language models that generate responses based on pattern matching, reasoning models are trained to think out loud and solve problems step-by-step, similar to how a human doctor would approach a diagnostic puzzle. Rodman told CBC that this approach improves diagnostic accuracy significantly.

Independent coverage from NPR corroborates the findings, noting the AI outperformed two experienced physicians using only electronic health records and limited information available at each stage. The model also beat the earlier GPT-4 version in these clinical tasks.

Here's where it gets interesting. The AI worked with text alone. In real clinical medicine, doctors don't just read charts. They listen to heartbeats with a stethoscope, check skin temperature with their hands, watch how a patient moves, and notice subtle cues that never make it into electronic records. (This is why you can't just send your symptoms to a chatbot and expect a cure.)

Dr. Nour Khatib, an Ontario physician working at Oak Valley Health's Markham Stouffville and Uxbridge Hospitals, emphasized this distinction. She described a recent patient where triage information suggested one condition, but her physical examination with a stethoscope revealed something entirely different. AI isn't going to intubate a patient or put a cast on an injured limb.

Dr. Amol Verma, an internal medicine physician at Toronto's St. Michael's Hospital, called the comparison between AI and doctors "false." He noted that no physician makes decisions based purely on text information. The physical examination — how someone looks, sounds, and feels — forms the foundation of diagnosis.

The study authors are careful not to overstate their findings. Rodman acknowledged the limitations and emphasized that more robust clinical trials are needed to ensure real-world efficacy and safety. The research relied only on data analysis, with no effect on actual diagnoses or treatments.

Privacy concerns also loom large. Verma pointed out that OpenAI is an American company, and the study relied on a model trained on U.S. data within a largely privatized healthcare system. He questioned whether the findings would apply to the Canadian context, where patient information privacy standards differ significantly.

Dr. David Reich, chief clinical officer for Mount Sinai Health System in New York (who was not involved in the work), called the paper a "beautiful summary" of how much the technology has improved. He noted that prior versions of large language models faltered when dealing with uncertainty and generating differential diagnoses. Now, he says, you have something "possibly ready for prime time."

The real question isn't whether the AI works — it clearly does for narrow diagnostic tasks. The question is how you introduce it into clinical workflows in ways that actually improve care without adding friction or liability. (Nobody wants to explain to a patient's family that the algorithm missed something the doctor would have caught.)

Some hospitals are already experimenting with AI tools. Khatib has been working with AI scribes that transcribe doctor-patient exchanges and create detailed medical notes. It's a pilot project done with prior patient consent. Hospitals are also exploring self-scheduling using AI and chatbots that help patients understand specific illnesses.

But Khatib insists all exploration of AI in hospital settings must be done responsibly. "We are dealing with AI by putting guardrails first," she said. "We're not chasing AI headlines first."

The emergency department represents only a small portion of a patient's total medical care. Rodman acknowledged the AI would likely not have performed as impressively with records from someone who'd spent a month in the hospital. The complexity and accumulated data would change the equation entirely.

None of the researchers involved believe the findings support replacing doctors with AI. Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School and part of the study, said the results make the case that AI models need rigorous testing through forward-looking trials. "It's a very challenging process to design these trials," he noted.

Whether hospitals actually pay for this technology, and whether patients trust it enough to accept AI-assisted diagnoses, remains the real question. The technology works in controlled studies. The messy reality of healthcare is a different matter entirely.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

AI Model Outperforms ER Doctors in Diagnosis Study

Comments