Research reveals cognitive limitations of chatbot using dementia screening

Almost all major large-scale language models or “chatbots” show signs of mild cognitive impairment in a test widely used to spot early signs of dementia, a study in the Christmas issue finds. It turned out that. B.M.J.
The results also show that, similar to older patients, “older” versions of the chatbot tend to perform worse on tests. The authors say these findings “challenge the assumption that artificial intelligence will soon replace human doctors.”
Major advances in the field of artificial intelligence have sparked exciting and fearful speculation about whether chatbots will be able to surpass human doctors.
Although several studies have shown that large-scale language models (LLMs) are remarkably adept at various medical diagnostic tasks, the susceptibility of LLMs to human impairments such as cognitive decline remains to be investigated. No.
To fill this knowledge gap, researchers leveraged the major public LLMs, ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by OpenAI). Developmental) cognitive abilities were assessed. Alphabet) – uses the Montreal Cognitive Assessment (MoCA) test.
The MoCA test is widely used to detect early signs of cognitive impairment and dementia, usually in older adults. Assess abilities such as attention, memory, language, visual-spatial skills, and executive function through a number of short tasks and questions. The maximum score is 30 points, and a score of 26 or higher is generally considered normal.
The instructions given to the LLM for each task were the same as those given to human patients. Scoring followed official guidelines and was evaluated by practicing neurologists.
ChatGPT 4o achieved the highest score (26 out of 30) in the MoCA test, followed by ChatGPT 4 and Claude (25 out of 30), while Gemini 1.0 had the lowest score (16 out of 30).
All chatbots are skilled at visual-spatial skills and execution tasks such as the execution task (connecting circled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time). showed poor performance. The Gemini model failed a delayed recall task (recalling a five-word sequence).
Most other tasks, such as naming, attention, language, and abstraction, were performed well by all chatbots.
However, in further visuospatial testing, the chatbot was unable to show empathy or accurately interpret complex visual scenes. Only ChatGPT 4o succeeded in the mismatch stage of the Stroop test, which uses color name and font color combinations to measure the impact of interference on reaction time.
These are observations, and the authors acknowledge fundamental differences between the human brain and large-scale language models.
However, the uniform failure of all large-scale language models on tasks requiring visual abstraction and executive functions highlights significant weaknesses that may hinder their use in clinical settings. they point out.
Therefore, the researchers concluded that: “Not only are neurologists unlikely to be replaced by large-scale language models any time soon, our findings also suggest that they will soon treat new virtual patients, namely artificial intelligence models who present with cognitive impairments. It suggests that something might happen.”
