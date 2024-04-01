



ChatGPT-4, an artificial intelligence program designed to understand and generate human-like text, outperforms internal medicine residents and attending physicians at two academic medical centers in processing medical data and demonstrating clinical reasoning. We achieved excellent results. In a research letter published in JAMA Internal Medicine, physician-scientists at Beth Israel Deaconess Medical Center (BIDMC) used a large-scale language model ( LLM) reasoning ability and human performance were directly compared.

“It became clear very early on that LLM could diagnose, but anyone who works in medicine knows there's more to medicine than that,” said BIDMC Internal Medicine Physician said internist Dr. Adam Rodman. “There are multiple steps behind diagnosis, so we wanted to assess whether LLMs are as good as doctors at making that kind of clinical reasoning. It is a surprising discovery that we can show better inferences than that in clinical cases. ”

Rodman et al. used a previously validated tool developed to assess physicians' clinical reasoning called the Revised IDEA (r-IDEA) score. The researchers recruited 21 attending physicians and 18 medical residents, each assigned to one of 20 selected clinical cases consisting of four consecutive stages of diagnostic inference. The authors asked physicians to write down and justify their differential diagnosis at each step. Chatbot GPT-4 was prompted with the same instructions and performed all 20 clinical cases. Their responses were then scored on clinical reasoning (r-IDEA scores) and several other reasoning measures.

“The first step is triage data, telling the patient what's bothering them and getting their vital signs,” said lead author Stephanie Cabral, M.D., a third-year internal medicine resident at BIDMC. “The second step is a system review and we obtain additional information from the patient. The third step is a physical exam and the fourth step is diagnostic and imaging tests.”

Rodman, Cabral, and colleagues found that the chatbot had the highest r-IDEA score, with a median score of 10 out of 10 for LLMs, a score of 9 points for attending physicians, and a score of 8 points for residents. I discovered. Humans and bots were close to a tie in terms of diagnostic accuracy, i.e., how high the correct diagnosis ranked in the list of diagnoses provided, and correct clinical reasoning. However, the researchers found that bots were sometimes “obviously wrong,” and their answers contained incorrect inferences far more often than residents. This finding supports the idea that AI is most likely to be useful as a tool that augments, rather than replaces, human reasoning processes.

“More research is needed to determine how best to incorporate LLM into clinical practice, but it can still serve as a checkpoint, helping to ensure that nothing is missed.” said Cabral. “My ultimate hope is that AI will reduce current inefficiencies, improve patient-doctor interactions, and allow us to focus more on the conversation with the patient.

“Early research suggested that AI could potentially make a diagnosis if all the information was given to it,” Rodman said. “What our research shows is that AI is demonstrating true reasoning, perhaps better than humans through multiple steps in the process. We have a unique opportunity to improve the quality and experience of healthcare.”

Co-authors include BIDMC's Dr. Zahir Kanjee, Dr. Philip Wilson, and Dr. Byron Crowe. Dr. Daniel Restrepo of Massachusetts General Hospital. Dr. Rajah Elie Abdulnoah of Brigham and Women's Hospital.

This research was supported by Harvard Catalyst | Funding from the Harvard University Center for Clinical and Translational Sciences, National Institutes of Health, National Center for the Advancement of Translational Sciences (award UM1TR004408), and Harvard University and its affiliated academic health care centers.

Potential conflict of interest: Mr. Rodman reports grants from the Gordon and Betty Moore Foundation. Crowe reports on her Solera Health employment and capital. Kanjee reports receiving editorial book royalties from Wolters Kluwer and paid advisory board memberships for medical education products not related to AI, and honorarium for continuing medical education provided by Oakstone Publishing. Mr. Abdulnoah reports his employment with the Massachusetts Medical Society (MMS), the nonprofit organization that owns NEJM Healer. Abdulnour says he receives no royalties from sales of NEJM Healer and does not own any stock in NEJM Healer. This study received no funding from MMS. Funding is being provided by the Gordan and Betty Moore Foundation through the National Academy of Medicine Diagnostic Excellence Fellowship, Abdulnah reported.

