Artikel
Analyzing large language models’ responses to common lumbar spine fusion surgery questions: A comparison between ChatGPT and Bard
Suche in Medline nach
Autoren
| Veröffentlicht: | 21. Oktober 2024 |
|---|
Gliederung
Text
Objectives: In this digital era, patients often seek information about lumbar spine fusion surgery on numerous websites of variable quality. Large Language Models (LLMs) like ChatGPT may offer transformative approaches to patient education, but these should be studied carefully before widespread use. The goal of our study is to evaluate Open AI's ChatGPT 3.5 and Google's Bard ability to respond to common patient questions regarding lumbar spine fusion surgery.
Methods: A Google search yielded 158 frequent patient questions, which were narrowed to 10 critical ones that were given as prompts to both chatbots (Table 1 [Tab. 1]). Responses were rated for clarity, accuracy, and comprehensiveness by 5 blinded spine surgeons, using a 4-point scale ranging from ‘unsatisfactory’ to ‘excellent’. Additionally, raters evaluated the clarity and professionalism of both answer sets using a 5-point Likert Scale.
Results: Across both language models and all 10 questions, 97% of responses were excellent or satisfactory. The majority of LLM responses were either excellent (64%) or required minimal clarification (28%), while only 5% required moderate clarification and 3% substantial clarification/unsatisfactory. ChatGPT 3.5 had 62% excellent, 32% minimal clarification, 2% moderate clarification, and 4% substantial clarification/unsatisfactory responses. Bard produced 66% excellent, 24% minimal clarification, 8% moderate clarification, and 2% substantial clarification/unsatisfactory responses. There was no significant difference in overall rating distribution between the ChatGPT and Bard models (Figure 1 [Fig. 1]).
For three specific questions (Q3, Q4, and Q5), both models struggled to provide detailed, personalized responses. Q3 involved complex surgical risks, Q4 focused on variable success rates, and Q5 dealt with the intricacies of selecting the optimal surgical approach. Inter-rater reliability was poor (ChatGPT: k=0.041, p=0.622; Bard: k=-0.040, p=0.601). Both models scored high on the Likert scale for understanding and empathy. However, Bard was rated slightly lower in empathy and professionalism by some raters, although this was not statistically significant.
Conclusion: In the majority of cases, ChatGPT3.5 and Bard provided accurate, comprehensive, and accessible information on lumbar spine fusion frequently asked questions. The role of LLMs in medical patient education requires further investigation and appropriate training to ensure high quality and reliability in healthcare communications.
