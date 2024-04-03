Business
Chatbot outperformed doctors in clinical reasoning in comparative study
Artificial intelligence was also “just wrong” much more often
BOSTON – ChatGPT-4, an artificial intelligence program designed to understand and generate human-like text, outperformed internal medicine residents and attending physicians at two academic medical centers in processing medical data and demonstrating clinical reasoning . In a research letter published in JAMA Internal Medicine, physician-scientists at Beth Israel Deaconess Medical Center (BIDMC) compared the reasoning abilities of a large language model (LLM) directly to human performance using standards developed to evaluate doctors.
“It became clear early on that LLMs can make diagnoses, but anyone who practices medicine knows that medicine is much more than that,” said Adam Rodman MD, an internal medicine physician and researcher in the Department of Medicine at BIDMC. “A diagnosis involves several steps, so we wanted to assess whether LLMs are as effective as doctors at performing this type of clinical reasoning. It is a surprising discovery that these things are capable of showing reasoning equivalent to or better than that of humans throughout the evolution of the clinical case.
Rodman and colleagues used a previously validated tool developed to assess physicians' clinical reasoning, called the revised-IDEA (r-IDEA) score. The investigators recruited 21 attending physicians and 18 residents who each worked on one of 20 selected clinical cases comprising four sequential stages of diagnostic reasoning. The authors asked physicians to write and justify their differential diagnoses at each stage. The GPT-4 chatbot received a prompt with identical instructions and performed all 20 clinical cases. Their responses were then scored for clinical reasoning (r-IDEA score) and several other measures of reasoning.
“The first step is triage data, when the patient tells you what's bothering them and you get their vital signs,” said lead author Stephanie Cabral, MD, a third-year resident in internal medicine at BIDMC. “The second step is system review, when you get additional information from the patient. The third step is the physical examination and the fourth is diagnostic tests and imaging.
Rodman, Cabral, and colleagues found that the chatbot achieved the highest r-IDEA scores, with a median score of 10 out of 10 for LLM, 9 for attending physicians, and 8 for residents. It was more of a toss-up between the humans and the robot when it came to diagnostic accuracy (where the correct diagnosis fit in the list of diagnoses they provided) and correct clinical reasoning. But the robots were also “just plain wrong” – they had more instances of incorrect reasoning in their responses – much more often than the residents, the researchers found. This finding underscores the idea that AI will likely be most useful as a tool to augment, not replace, the human reasoning process.
“More studies are needed to determine how LLMs can best be integrated into clinical practice, but even now they could be useful as a checkpoint, helping us ensure we aren't missing anything,” Cabral said. “My ultimate hope is that AI will improve the patient-doctor interaction by reducing some of the inefficiencies we currently experience and allow us to focus more on the conversation we have with our patients.
“Early studies suggested that AI could make diagnoses, if all the information was fed to it,” Rodman said. “What our study shows is that AI demonstrates real reasoning – perhaps better reasoning than humans across several stages of the process. We have a unique opportunity to improve the quality and experience of healthcare for patients.
Co-authors included Zahir Kanjee, MD, Philip Wilson, MD, and Byron Crowe, MD, of BIDMC; Daniel Restrepo, MD, of Massachusetts General Hospital; and Raja-Elie Abdulnour, MD, of Brigham and Women's Hospital.
This work was carried out with support from Harvard Catalyst | The Harvard Clinical and Translational Science Center (National Center for Advancing Translational Sciences, National Institutes of Health) (award UM1TR004408) and financial contributions from Harvard University and its affiliated academic health centers.
Potential conflicts of interest: Rodman reports a grant from the Gordon and Betty Moore Foundation. Crowe reports on employment and equity at Solera Health. Kanjee reports receiving royalties for edited books and membership on a paid advisory board for non-AI medical education products from Wolters Kluwer, as well as honoraria for continuing medical education provided by Oakstone Publishing. Abdulnour says he is employed by the Massachusetts Medical Society (MMS), a nonprofit organization that owns NEJM Healer. Abdulnour does not receive royalties from sales of NEJM Healer and does not own an ownership interest in NEJM Healer. No funding was provided by MMS for this study. Abdulnour reports funding from the Gordan and Betty Moore Foundation through the National Academy of Medicine Scholars in Diagnostic Excellence.
|
