GMS | Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2024) | Analyzing large language models’ responses to common lumbar spine fusion surgery questions: A comparison between ChatGPT and Bard

Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2024)

22. - 25.10.2024, Berlin

Artikel

XML Version

Artikel empfehlen

Analyzing large language models’ responses to common lumbar spine fusion surgery questions: A comparison between ChatGPT and Bard

Meeting Abstract

Suche in Medline nach

Siegmund Lang - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
Yoseph Ezra Tilahun - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
Aneysis D. Gonzalez Suarez - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
Robert Kim - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
Parastou Fatemi - Cleveland Clinic, Department of Neurosurgery, Cleveland, United States
Katherine Wagner - Ventura Neurosurgery, Department of Neurosurgery, Ventura, Germany
Nicolai Maldaner - University Hospital Zurich & Clinical Neuroscience Center, Department of Neurosurgery, Zürich, Switzerland
Martin N. Stienen - Cantonal Hospital St. Gallen & Medical School of St.Gallen, Department of Neurosurgery, Spine Center of Eastern Switzerland, St. Gallen, Switzerland
Corinna Zygourakis - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States

Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2024). Berlin, 22.-25.10.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAB35-3034

doi: 10.3205/24dkou140, urn:nbn:de:0183-24dkou1404

Veröffentlicht:	21. Oktober 2024

© 2024 Lang et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.

Gliederung

Top
Text

Text

Objectives: In this digital era, patients often seek information about lumbar spine fusion surgery on numerous websites of variable quality. Large Language Models (LLMs) like ChatGPT may offer transformative approaches to patient education, but these should be studied carefully before widespread use. The goal of our study is to evaluate Open AI's ChatGPT 3.5 and Google's Bard ability to respond to common patient questions regarding lumbar spine fusion surgery.

Methods: A Google search yielded 158 frequent patient questions, which were narrowed to 10 critical ones that were given as prompts to both chatbots (Table 1 [Tab. 1]). Responses were rated for clarity, accuracy, and comprehensiveness by 5 blinded spine surgeons, using a 4-point scale ranging from ‘unsatisfactory’ to ‘excellent’. Additionally, raters evaluated the clarity and professionalism of both answer sets using a 5-point Likert Scale.

Results: Across both language models and all 10 questions, 97% of responses were excellent or satisfactory. The majority of LLM responses were either excellent (64%) or required minimal clarification (28%), while only 5% required moderate clarification and 3% substantial clarification/unsatisfactory. ChatGPT 3.5 had 62% excellent, 32% minimal clarification, 2% moderate clarification, and 4% substantial clarification/unsatisfactory responses. Bard produced 66% excellent, 24% minimal clarification, 8% moderate clarification, and 2% substantial clarification/unsatisfactory responses. There was no significant difference in overall rating distribution between the ChatGPT and Bard models (Figure 1 [Fig. 1]).

For three specific questions (Q3, Q4, and Q5), both models struggled to provide detailed, personalized responses. Q3 involved complex surgical risks, Q4 focused on variable success rates, and Q5 dealt with the intricacies of selecting the optimal surgical approach. Inter-rater reliability was poor (ChatGPT: k=0.041, p=0.622; Bard: k=-0.040, p=0.601). Both models scored high on the Likert scale for understanding and empathy. However, Bard was rated slightly lower in empathy and professionalism by some raters, although this was not statistically significant.

Conclusion: In the majority of cases, ChatGPT3.5 and Bard provided accurate, comprehensive, and accessible information on lumbar spine fusion frequently asked questions. The role of LLMs in medical patient education requires further investigation and appropriate training to ensure high quality and reliability in healthcare communications.

gms | German Medical Science