gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

Leveraging Large Language Models to Decode Medical Concepts and Logical Connections in Clinical Guidelines

Meeting Abstract

Search Medline for

  • Juliette Wegner - Universitätsmedizin Greifswald, Klinik für Anästhesie, Intensiv-, Notfall- und Schmerzmedizin, Greifswald, Germany
  • Lea Trautmann - Universitätsmedizin Greifswald, Klinik für Anästhesie, Intensiv-, Notfall- und Schmerzmedizin, Greifswald, Germany
  • Gregor Lichtner - Universitätsmedizin Greifswald, Klinik für Anästhesie, Intensiv-, Notfall- und Schmerzmedizin, Greifswald, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 172

doi: 10.3205/24gmds177, urn:nbn:de:0183-24gmds1776

Published: September 6, 2024

© 2024 Wegner et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Clinical practice guidelines are systematically developed statements designed to optimize patient care. Given the broad range of topics these guidelines cover, it can be challenging for healthcare practitioners to recognize every scenario where a guideline recommendation applies. Computer-based guidance systems can offer decision support, but these systems require guidelines that are computer-interpretable [1], [2]. However, such guidelines are rarely available, mainly due to the time-consuming process of translating narrative into computer-interpretable guidelines [3]. Therefore, we have investigated the viability of enlisting the help of large language models (LLMs) to speed up this process. Specifically, we tested how effectively LLMs could identify medical concepts and logical connections within clinical guidelines.

Methods: We used the INCEpTION platform [4] to annotate German medical guidelines by the German Society for Anaesthesiology and Intensive Care Medicine (DGAI). An annotation guide was developed based on the PICO (Population, Intervention, Comparison, Outcome) framework, focusing on the PI components, which are the most relevant for guidelines. We performed prompt engineering for GPT-4 to ensure parsable outputs and improve the micro-F1 score. Semantic equivalence between manual and GPT-4 annotations was assessed using cosine similarity of GottBERT embeddings, with manual review for accuracy. Medical expertise was not directly involved, as the process focused on matching concepts from the same underlying text, minimizing interpretation.

Results: We developed a corpus of annotated German medical guidelines, covering four different guidelines and a total of 266 recommendations. A major challenge in the manual annotation was the ambiguity of classifying a concept as either ‘population’ or ‘intervention’, which we also addressed in our annotation guide. Analysing the quality of GPT-4's annotations, we observed an average precision of 49.9% and 49.0% for population and interventions, respectively. When ignoring the PI label, the precision reached 57.4%. The recall presented a bigger discrepancy between population and intervention: 48.7% and 74.6%, respectively. When ignoring the PI labels, this reached 79.3%. Both precision and recall varied significantly depending on the recommendation that was annotated.

Discussion: We noticed that GPT-4 seemed to share some of the challenges that were encountered during manual annotation, particularly while determining if a concept belonged to population or intervention. This was reflected by the discrepancy in the precision and recall metrics. Another key reason for the observed precision/recall is the challenge of distinguishing core medical concepts from descriptive attributes in recommendations. This difficulty, even for human annotators, is compounded by the matching strategy, which may treat overlapping concepts as non-matches, leading to discrepancies in annotation accuracy.

Conclusion: The main issue with the automatically annotated guidelines was the inaccurate categorization of concepts into population or intervention, despite satisfactory precision and recall when categories were ignored. While it seems impractical to rely solely on GPT-4 to annotate medical guidelines without human supervision, GPT-4 can assist human annotators by preprocessing guidelines and suggesting annotations, speeding up and simplifying the process. It may also be worth looking into LLMs that are specifically tailored to the processing of medical texts, with the aim of improving the quality of the automated annotation.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Lichtner G, Spies C, Jurth C, Bienert T, Mueller A, Kumpf O, et al. Automated Monitoring of Adherence to Evidenced-Based Clinical Guideline Recommendations: Design and Implementation Study. Journal of Medical Internet Research. 2023 May 4;25(1):e41177.
2.
Lichtner G, Alper BS, Jurth C, Spies C, Boeker M, Meerpohl JJ, et al. Representation of evidence-based clinical practice guideline recommendations on FHIR. Journal of Biomedical Informatics. 2023 Mar 1;139:104305.
3.
Michaels M. Adapting Clinical Guidelines for the Digital Age: Summary of a Holistic and Multidisciplinary Approach. American Journal of Medical Quality. 2023 Oct;38(5S):S3.
4.
Klie JC, Bugert M, Boullosa B, Eckart de Castilho R, Gurevych I. The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation. In: Zhao D, editor. Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations [Internet]. Santa Fe, New Mexico: Association for Computational Linguistics; 2018 [cited 2024 Apr 16]. p. 5–9. Available from: https://aclanthology.org/C18-2002 External link