GMS | Gemeinsame Jahrestagung der Gesellschaft für Medizinische Ausbildung (GMA) und des Arbeitskreises zur Weiterentwicklung der Lehre in der Zahnmedizin (AKWLZ) | Large Language models surpass medical students in the German medical licensing examination

Gemeinsame Jahrestagung der Gesellschaft für Medizinische Ausbildung (GMA) und des Arbeitskreises zur Weiterentwicklung der Lehre in der Zahnmedizin (AKWLZ)

05.08. - 09.08.2024, Freiburg, Schweiz

Artikel

XML Version

Artikel empfehlen

Large Language models surpass medical students in the German medical licensing examination

Meeting Abstract

Suche in Medline nach

Mark Enrik Geißler - University Hospital and Faculty of Medicine Carl Gustav Carus, Technical University Dresden, Department of Visceral, Thoracic and Vascular Surgery, Dresden, Germany
Merle Goeben - University Hospital and Faculty of Medicine, Department of Trauma Surgery, Orthopedics and Plastic Surgery, Göttingen, Germany
Jean-Paul Bereuter - University Hospital and Faculty of Medicine Carl Gustav Carus, Technical University Dresden, Department of Visceral, Thoracic and Vascular Surgery, Dresden, Germany
Rona Berit Geißler - University Hospital and Faculty of Medicine Carl Gustav Carus, Technical University Dresden, Department of Visceral, Thoracic and Vascular Surgery, Dresden, Germany
Kira Glasmacher - Emmanuel College, Boston, USA
Isabella Wiest - Technical University Dresden, Else Kroener Fresenius Center for Digital Health, Dresden, Germany; Medical Faculty Mannheim, Department of Medicine II, Mannheim, Germany
Fiona Kolbinger - Purdue University, Weldon School of Biomedical Engineering, USA; Purdue University, Regenstrief Center for Healthcare Engineering (RCHE), USA
Jakob N. Kather - Technical University Dresden, Else Kroener Fresenius Center for Digital Health, Dresden, Germany; University Hospital Heidelberg, Medical Oncology, National Center for Tumor Diseases (NCT), Heidelberg, Germany

Gemeinsame Jahrestagung der Gesellschaft für Medizinische Ausbildung (GMA) und des Arbeitskreises zur Weiterentwicklung der Lehre in der Zahnmedizin (AKWLZ). Freiburg, Schweiz, 05.-09.08.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocV-21-04

doi: 10.3205/24gma081, urn:nbn:de:0183-24gma0810

Veröffentlicht:	30. Juli 2024

© 2024 Geißler et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.

Gliederung

Top
Text

Text

Background and study aim: Large Language Models have gained significant attention and experience rising applications in different subjects. Recent models, such as ChaGPT and Gemini are increasingly used in the medical field. Nevertheless these models are not specifically built for medical purposes.

Therefore this study aims to examine the capabilities of three publicly accessible Large Language Models (LLMs), GPT-3.5, GPT-4.0, and Gemini-Pro, in the first (M1, preclinical) and second (M2, clinical) part of the German medical licensing examination, consisting of single and case-based questions.

Material and methods: The original German question of the 2023 autumn M1 and M2 exams were extracted and prompted to GPT-3.5, GPT-4.0 and Gemini-Pro. Image questions were excluded due to the models limited input. Model outputs were analyzed for their accuracy as well as the variations of the respective LLM’s performance, regarding case based questions and respective subjects. We also compared the model’s performance to students’ results.

Results: All models achieved passing scores for the M1 and M2 respectively. GPT-4.0 achieved near perfect results, correctly answering 93.1% in the M1 and 94% in the M2. These results are above the student overall average of 73.4% in the M1 and 74.9% for the M2. In comparison GPT-3.5 and Gemini did also pass the exam was less accurate in the M1 (GPT-3.5 77.10%, Gemini-Pro 74.81%) and M2 (GPT-3.5 76.40%, Gemini-Pro 65.54%).there were no significant differences in the models ability to answer singel choice and case based questions. GPT-4.0 performed also better on difficult questions (M1: 78.60%; M2: 82.90%) compared to GPT-3.5 (M1: 42.90%; M2: 42.90%) and Gemini-Pro (M1: 35.70%; M2: 37.10%).

Conclusion: LLMs show improved capabilities to answer medical questions. This holds great importance for the implementation into clinical practice. Especially in areas such as decision support and quality assurance LLMs hold the potential to support physicians. Furthermore the capability of these LLMs hold promise for medical students education. Particularly the possibility to not only answer, but reason answer choices and respond to further questioning opens up the possibility of personalized education.

Take home messages: Large Language Models show increased accuracy in answering medical exam questions, passing the German medical licensing exam and in particular cases outperform medical students.

LLMs open new possibilities of personalized medical education.

gms | German Medical Science