Cognitive Shortcomings of AI Chatbots in Dementia Screening Tests

December 19, 2024

Recent research has delved into the cognitive limitations of large language models (LLMs), or “chatbots,” when subjected to dementia screening tests traditionally used on humans. The study, published in the Christmas issue of The BMJ, reveals that almost all leading LLMs exhibit signs of mild cognitive impairment. This finding challenges the assumption that artificial intelligence (AI) will soon replace human doctors in clinical settings, particularly in diagnosing early signs of dementia.

The Study and Its Objectives

Examining AI’s Diagnostic Capabilities

Significant advances in artificial intelligence have sparked debates about whether AI can outperform human physicians in medical tasks. Previous research has shown that LLMs are proficient in various diagnostic functions, yet their susceptibility to human-like cognitive impairments had not been examined until now. To bridge this gap, researchers assessed the cognitive abilities of prominent, publicly available LLMs using the Montreal Cognitive Assessment (MoCA) test. This study seeks to provide a more comprehensive understanding of how well these AI systems can perform when faced with diagnostic challenges typically managed by human clinicians.

The MoCA test is a widely recognized tool for detecting cognitive impairment and early signs of dementia, predominantly in older adults. It involves a series of short tasks and questions designed to evaluate different cognitive abilities, including attention, memory, language, visuospatial skills, and executive functions. The highest achievable score on the MoCA test is 30 points, with a score of 26 or above generally considered indicative of normal cognitive function. By utilizing this assessment, researchers aimed to identify the potential cognitive limitations of AI chatbots and to understand whether these models could effectively contribute to medical diagnostic tasks.

Performance of Leading LLMs

ChatGPT and Claude’s Scores

The study evaluated several LLMs, including ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). Among these, ChatGPT 4o achieved the highest score, with 26 out of 30 points, closely followed by ChatGPT 4 and Claude, each scoring 25 out of 30. These results indicate that some LLMs can perform within the normal cognitive range as defined by the MoCA test, although they still exhibit signs of mild cognitive impairment.

Nonetheless, it is essential to note that these chatbots’ scores varied considerably in specific cognitive domains. While they excelled in areas like naming, attention, language, and abstraction, they struggled with tasks requiring more complex visuospatial skills and executive functions. These findings underscore that, although AI has made significant strides in certain aspects, there remain critical areas where human cognitive abilities outshine those of chatbots.

Gemini’s Lower Performance

In contrast, Gemini 1.0 scored significantly lower, with only 16 out of 30 points. This marked difference between Gemini and the other models highlights a considerable gap in cognitive performance among LLMs. Notably, older versions of the chatbots tended to perform worse on the tests, showing a pattern of decline similar to the cognitive deterioration often seen in older human patients. These results suggest that while newer models of AI show some improvement, there remains a significant disparity between their cognitive capabilities and those of humans.

This raises important questions about the reliability and consistency of AI in medical diagnostics. The substantial variation in performance among different models and versions indicates that AI technology is still in the early stages of development and may require further refinement and advancement before it can reliably support clinical applications. Furthermore, the lower performance in tasks demanding advanced cognitive functions points to an inherent limitation in the current generation of LLMs.

Specific Cognitive Challenges

Visuospatial Skills and Executive Functions

All chatbots exhibited poor performance in tasks requiring visuospatial skills and executive functions. These tasks included the trail-making task, which involves connecting encircled numbers and letters in ascending order, and the clock drawing test, where participants must draw a clock face showing a specific time. These tasks are crucial for diagnosing cognitive impairments, and the chatbots’ struggles in these areas highlight significant limitations. Failures in tasks that require planning, organization, and the ability to understand and manipulate spatial relationships emphasize a pronounced weakness in current AI capabilities.

Failures in these areas are particularly concerning given their importance in diagnosing and managing cognitive disorders. Visuospatial skills and executive functions are critical for everyday tasks and decision-making processes. Therefore, AI systems that lack proficiency in these areas may struggle to provide the comprehensive and nuanced support needed in clinical practice. This finding suggests that additional research and development are needed to enhance the cognitive capabilities of LLMs and ensure they can effectively assist in medical diagnostics without compromising quality or accuracy.

Delayed Recall and Complex Tasks

The Gemini models, in particular, failed the delayed recall task, which requires remembering a sequence of five words. Despite these challenges, most chatbots performed well in other cognitive areas such as naming, attention, language, and abstraction. However, they struggled with more complex visuospatial tasks that demanded interpretation of visual scenes or demonstrating empathy. These complex tasks are essential for providing holistic patient care, emphasizing the gap between AI capabilities and the cognitive processes performed by human clinicians.

The inability to effectively perform tasks requiring delayed recall and advanced visuospatial skills suggests that while AI may support certain diagnostic areas, it is not yet equipped to handle the full spectrum of cognitive assessments necessary in clinical settings. This highlights the importance of continued human involvement and oversight in medical diagnostics. Moreover, the difficulties faced by AI in tasks that require empathy and emotional intelligence point to an inherent limitation in its ability to replicate the human aspect of clinical care. This underscores the need for a balanced approach that leverages the strengths of AI while acknowledging its current limitations.

Implications for Clinical Use

Limitations in Visual Abstraction and Executive Function

The study acknowledges the fundamental differences between the human brain and large language models, emphasizing an essential distinction in cognitive capabilities. While LLMs can excel in specific diagnostic tasks, their uniform failure in tasks requiring visual abstraction and executive function pinpoints a significant limitation. This weakness is particularly relevant for their potential use in clinical settings, where such cognitive skills are crucial. The research highlights that while AI technology continues to advance rapidly, its application in areas requiring complex cognitive processes remains limited.

Given these findings, it is clear that AI’s role in clinical settings must be carefully considered. While LLMs can support medical professionals in various ways, they are not yet capable of replacing the nuanced cognitive skills required in diagnosing and managing complex medical conditions. This reinforces the need for a collaborative approach, where AI supports rather than replaces human clinicians, ensuring the highest possible standards of care.

Future of AI in Medicine

The authors of the study conclude that it is unlikely that neurologists will be replaced by large language models in the foreseeable future. Instead, they foresee a scenario where neurologists may encounter new “virtual patients” – large language models depicting cognitive impairments similar to those seen in human patients. This underscores the need for caution in integrating AI into medical diagnostics and highlights the importance of continued human oversight. The study’s findings point to a future where AI and human expertise coexist, each complementing the other’s strengths.

Overall, the study provides a nuanced understanding of the capabilities and limitations of current large language models in a clinical context. The results emphasize the importance of recognizing these limitations to ensure the responsible and effective application of AI in healthcare. By identifying specific areas where LLMs fall short, the research contributes to a more informed dialogue on the future roles of AI in medicine and underscores the irreplaceable value of human cognitive skills in diagnosing and managing complex medical conditions.

Conclusion

Recent research has explored the cognitive capabilities and limitations of large language models (LLMs), often referred to as “chatbots,” when they undergo dementia screening tests that are usually administered to humans. This study, featured in the Christmas issue of The BMJ, indicates that nearly all prominent LLMs show signs of mild cognitive impairment. These findings question the assumption that artificial intelligence (AI) will soon be able to replace human physicians in clinical environments, particularly for the early detection of dementia.

Such revelations underscore the ongoing challenge of AI in achieving human-like diagnostic proficiency. While AI holds promise for various applications in healthcare, its current cognitive shortcomings highlight the need for further advancements before it can fully take on roles that require intricate medical evaluations. This research emphasizes the importance of maintaining a critical perspective on the capabilities of AI in medicine and reinforces the necessity of human expertise in diagnosing complex conditions like dementia.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later