Artikel
Large Language models surpass medical students in the German medical licensing examination
Suche in Medline nach
Autoren
| Veröffentlicht: | 30. Juli 2024 |
|---|
Gliederung
Text
Background and study aim: Large Language Models have gained significant attention and experience rising applications in different subjects. Recent models, such as ChaGPT and Gemini are increasingly used in the medical field. Nevertheless these models are not specifically built for medical purposes.
Therefore this study aims to examine the capabilities of three publicly accessible Large Language Models (LLMs), GPT-3.5, GPT-4.0, and Gemini-Pro, in the first (M1, preclinical) and second (M2, clinical) part of the German medical licensing examination, consisting of single and case-based questions.
Material and methods: The original German question of the 2023 autumn M1 and M2 exams were extracted and prompted to GPT-3.5, GPT-4.0 and Gemini-Pro. Image questions were excluded due to the models limited input. Model outputs were analyzed for their accuracy as well as the variations of the respective LLM’s performance, regarding case based questions and respective subjects. We also compared the model’s performance to students’ results.
Results: All models achieved passing scores for the M1 and M2 respectively. GPT-4.0 achieved near perfect results, correctly answering 93.1% in the M1 and 94% in the M2. These results are above the student overall average of 73.4% in the M1 and 74.9% for the M2. In comparison GPT-3.5 and Gemini did also pass the exam was less accurate in the M1 (GPT-3.5 77.10%, Gemini-Pro 74.81%) and M2 (GPT-3.5 76.40%, Gemini-Pro 65.54%).there were no significant differences in the models ability to answer singel choice and case based questions. GPT-4.0 performed also better on difficult questions (M1: 78.60%; M2: 82.90%) compared to GPT-3.5 (M1: 42.90%; M2: 42.90%) and Gemini-Pro (M1: 35.70%; M2: 37.10%).
Conclusion: LLMs show improved capabilities to answer medical questions. This holds great importance for the implementation into clinical practice. Especially in areas such as decision support and quality assurance LLMs hold the potential to support physicians. Furthermore the capability of these LLMs hold promise for medical students education. Particularly the possibility to not only answer, but reason answer choices and respond to further questioning opens up the possibility of personalized education.
Take home messages: Large Language Models show increased accuracy in answering medical exam questions, passing the German medical licensing exam and in particular cases outperform medical students.
LLMs open new possibilities of personalized medical education.
