gms | German Medical Science

Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2024)

22. - 25.10.2024, Berlin

Analyzing large language models’ responses to common lumbar spine fusion surgery questions: A comparison between ChatGPT and Bard

Meeting Abstract

  • presenting/speaker Siegmund Lang - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
  • Yoseph Ezra Tilahun - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
  • Aneysis D. Gonzalez Suarez - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
  • Robert Kim - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States
  • Parastou Fatemi - Cleveland Clinic, Department of Neurosurgery, Cleveland, United States
  • Katherine Wagner - Ventura Neurosurgery, Department of Neurosurgery, Ventura, Germany
  • Nicolai Maldaner - University Hospital Zurich & Clinical Neuroscience Center, Department of Neurosurgery, Zürich, Switzerland
  • Martin N. Stienen - Cantonal Hospital St. Gallen & Medical School of St.Gallen, Department of Neurosurgery, Spine Center of Eastern Switzerland, St. Gallen, Switzerland
  • Corinna Zygourakis - Standford University School of Medicine, Department of Neurosurgery, Stanford, United States

Deutscher Kongress für Orthopädie und Unfallchirurgie (DKOU 2024). Berlin, 22.-25.10.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAB35-3034

doi: 10.3205/24dkou140, urn:nbn:de:0183-24dkou1404

Veröffentlicht: 21. Oktober 2024

© 2024 Lang et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Objectives: In this digital era, patients often seek information about lumbar spine fusion surgery on numerous websites of variable quality. Large Language Models (LLMs) like ChatGPT may offer transformative approaches to patient education, but these should be studied carefully before widespread use. The goal of our study is to evaluate Open AI's ChatGPT 3.5 and Google's Bard ability to respond to common patient questions regarding lumbar spine fusion surgery.

Methods: A Google search yielded 158 frequent patient questions, which were narrowed to 10 critical ones that were given as prompts to both chatbots (Table 1 [Tab. 1]). Responses were rated for clarity, accuracy, and comprehensiveness by 5 blinded spine surgeons, using a 4-point scale ranging from ‘unsatisfactory’ to ‘excellent’. Additionally, raters evaluated the clarity and professionalism of both answer sets using a 5-point Likert Scale.

Results: Across both language models and all 10 questions, 97% of responses were excellent or satisfactory. The majority of LLM responses were either excellent (64%) or required minimal clarification (28%), while only 5% required moderate clarification and 3% substantial clarification/unsatisfactory. ChatGPT 3.5 had 62% excellent, 32% minimal clarification, 2% moderate clarification, and 4% substantial clarification/unsatisfactory responses. Bard produced 66% excellent, 24% minimal clarification, 8% moderate clarification, and 2% substantial clarification/unsatisfactory responses. There was no significant difference in overall rating distribution between the ChatGPT and Bard models (Figure 1 [Fig. 1]).

For three specific questions (Q3, Q4, and Q5), both models struggled to provide detailed, personalized responses. Q3 involved complex surgical risks, Q4 focused on variable success rates, and Q5 dealt with the intricacies of selecting the optimal surgical approach. Inter-rater reliability was poor (ChatGPT: k=0.041, p=0.622; Bard: k=-0.040, p=0.601). Both models scored high on the Likert scale for understanding and empathy. However, Bard was rated slightly lower in empathy and professionalism by some raters, although this was not statistically significant.

Conclusion: In the majority of cases, ChatGPT3.5 and Bard provided accurate, comprehensive, and accessible information on lumbar spine fusion frequently asked questions. The role of LLMs in medical patient education requires further investigation and appropriate training to ensure high quality and reliability in healthcare communications.