Can GPT-4 Accurately Analyze Multilingual Medical Notes?

January 6, 2025

The ability of Generative Pre-trained Transformer 4 (GPT-4) to effectively analyze medical notes in multiple languages has garnered significant interest within the medical and technology communities. A recently published comprehensive evaluation study in Lancet Digital Health by a consortium of researchers aimed to assess GPT-4’s performance with medical notes specifically written in English, Spanish, and Italian. This study is particularly relevant for healthcare settings where multilingual medical documentation is common, and it highlights the challenges posed by the unstructured nature of clinical narratives and the variability in documentation styles among different healthcare providers. Understanding how GPT-4 manages to process and interpret these diverse notes could pave the way for advancements in automated patient care documentation.

The Importance of Medical Notes in Patient Care

Medical notes serve as an essential tool for documenting patient care, capturing invaluable clinical insights that include medical histories, treatment plans, and patient progress. These notes play a crucial role in the medical decision-making process by providing healthcare providers with the necessary information to deliver accurate and timely care. However, their unstructured format, typically full of free text and clinician-specific jargon, poses significant challenges for automated data extraction and analysis.

Historically, large-language models (LLMs) like GPT-4 have demonstrated potential in extracting explicit details from such notes, for instance, medications and dosages. Yet they often struggle with understanding implicit contextual information necessary for more nuanced medical decision-making. Most research to date has focused on LLMs’ effectiveness with English-language medical notes. The need to evaluate performance across different languages and in diverse healthcare settings is increasingly apparent. This gap in research prompted the consortium to undertake a wider study to understand and improve the applicability of LLMs like GPT-4 in global medical practice.

Study Methodology and Scope

To achieve a comprehensive understanding of GPT-4’s capabilities, a retrospective model-evaluation study was organized, involving eight university hospitals located in the United States, Colombia, Singapore, and Italy. These institutions are part of the 4CE Consortium and include Boston Children’s Hospital, University of Michigan, University of Wisconsin, National University of Singapore, University of Kansas Medical Center, University of Pittsburgh Medical Center, Universidad de Antioquia, and Istituti Clinici Scientifici Maugeri. The study was coordinated by the Department of Biomedical Informatics at Harvard University.

Each of the participating sites contributed seven de-identified medical notes written between February 1, 2020, and June 1, 2023, resulting in a total of 56 medical notes. The selection included admission, progress, and consultation notes but excluded discharge summaries. Patients were typically aged 18-65 years with diagnoses of obesity and COVID-19 at admission, although adhering to these criteria was optional. The scope and diversity of the notes provided a robust basis for evaluating GPT-4’s performance across different types of medical documentation and languages.

Analysis Process and Evaluation

The analysis process utilized GPT-4’s API in Python through a predefined question-answer framework, with parameters such as temperature, top-p, and frequency penalty finely tuned to achieve optimal performance. After generating responses for each medical note, physicians evaluated these answers by indicating their agreement or disagreement. Importantly, the physicians were masked to each other’s evaluations but not to GPT-4’s responses, ensuring a fair assessment of the model’s output.

Statistical analyses were subsequently performed to gauge the agreement between physician evaluations and GPT-4’s responses. Errors were categorized as extraction, inference, or hallucination issues. Out of the 56 medical notes analyzed, 42 were in English, and seven each in Italian and Spanish. This resulted in a total of 784 responses generated by GPT-4. Physicians agreed with these responses in 622 out of 784 cases, a 79% agreement rate. In 82 responses (11%), one physician agreed, while both disagreed in 80 responses (10%).

Performance Across Different Languages

A noteworthy finding from the study was that physicians showed higher agreement rates with GPT-4 for Spanish (88%) and Italian (84%) notes compared to English notes (77%). This suggests that GPT-4 may perform better with non-English notes, potentially because of the higher complexity of U.S. medical notes, such as varied structure and extensive use of acronyms and specialized terms. However, neither the complexity nor the length of the notes seemed to influence the model’s performance significantly.

Analysis of areas where physicians disagreed with GPT-4 revealed several critical issues. Inference issues, where different interpretations of implicit information arose, were common. For example, discrepancies often appeared when inferring a patient’s current COVID-19 status. Extraction errors involved GPT-4 missing or misidentifying explicit details, like documented medical history. Hallucination issues, where GPT-4 fabricated information not present in the notes, also occurred, such as incorrectly asserting a patient had COVID-19 when it had not been mentioned.

Challenges in Implicit Information and Contextual Understanding

When inference issues were the cause of disagreement, it was common for only one physician to align with GPT-4’s response. In such cases, inference issues accounted for 72% of the disagreements, extraction errors for 10%, and differences in the level of agreement made up 18%. Where both physicians disagreed with GPT-4, 59% of the issues were due to inference problems, 29% to extraction errors, and 13% to hallucinations.

Evaluating GPT-4’s ability to select patients for hypothetical study enrollment based on criteria like age, obesity, COVID-19 status, and note type also yielded varied sensitivity. GPT-4 demonstrated high sensitivity for identifying criteria such as obesity (97%), COVID-19 status (96%), and age (94%) but struggled with specificity regarding admission notes (22%). When the criterion for the type of note was excluded, GPT-4 accurately identified the remaining three criteria in 90% of cases. This highlights the difficulties that GPT-4 faces with implicit structural cues, a crucial capability for effective medical documentation analysis.

Implications for Future Research and Application

To fully understand GPT-4’s capabilities, a comprehensive model-evaluation study was conducted. Eight university hospitals in the United States, Colombia, Singapore, and Italy took part. These institutions are members of the 4CE Consortium, including Boston Children’s Hospital, University of Michigan, University of Wisconsin, National University of Singapore, University of Kansas Medical Center, University of Pittsburgh Medical Center, Universidad de Antioquia, and Istituti Clinici Scientifici Maugeri. The Department of Biomedical Informatics at Harvard University coordinated the study.

Each hospital provided seven de-identified medical notes written between February 1, 2020, and June 1, 2023, totaling 56 notes. These comprised admission, progress, and consultation notes, but discharge summaries were excluded. Patients were usually between 18-65 years old with diagnoses of obesity and COVID-19 at admission, though this criterion was flexible. This diverse collection of notes offered a strong foundation for assessing GPT-4’s performance in various types of medical documentation and languages, ensuring a thorough evaluation.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later