Article
Do large language models save resources when applying common-evidence appraisal tools?
Search Medline for
Authors
Published: | March 12, 2024 |
---|
Outline
Text
Background/research question: It is unknown whether large language models (LLMs) facilitate time- and resource-intensive text-related processes in evidence appraisal. Our aim was to quantify the agreement of LLMs with human raters in evidence appraisal of different levels of complexity: assessing reporting (PRISMA) and methodological rigor (AMSTAR) of systematic reviews, and degree of pragmatism in clinical trials (PRECIS-2).
Methods: Three state-of-the-art LLMs (OpenAI’s GPT-3.5, GPT-4; Anthropic’s Claude-2) assessed 112 systematic reviews (SRs) for PRISMA and AMSTAR, and 56 randomized trials (RCTs) for PRECIS-2. Corresponding ratings from two independent human raters and their consensus were available. We quantified accuracy as agreement between human consensus and (1) individual human raters; (2) individual LLMs; (3) combined LLMs; (4) human-AI collaboration. Ratings were marked as deferred (undecided) in case of inconsistency between combined LLMs or between the human rater and the LLM.
Results: Individual human rater accuracy was 89% (2,686/3,024) for PRISMA (27 items × 112 SRs), 89% (1,096/1,232) for AMSTAR (11 items × 112 SRs), and 75% (379/504) for PRECIS-2 (9 items × 56 RCTs). Individual LLM accuracy was lower, ranging from 63% (GPT-3.5) to 70% (Claude-2) for PRISMA, 53% (GPT-3.5) to 70% (GPT-4) for AMSTAR, and 38% (GPT-4) to 55% (GPT-3.5) for PRECIS-2. Combined LLM ratings led to accuracies of 76–85% for PRISMA (9–67% of items inconsistent and thus deferred), 70–83% for AMSTAR (14–76% deferred), and 61–74% for PRECIS-2 (55–96% deferred). Combining a human rater with individual LLMs resulted in the best accuracies from 89–96% for PRISMA (25–41% deferred), 91–96% for AMSTAR (27–52% deferred), and 80–86% for PRECIS-2 (64–75% deferred).
Conclusion: Current LLMs alone appraised evidence substantially worse than humans. Pairing a first human rater with an LLM as human-AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological quality (AMSTAR) but not for more complex tasks such as assessing pragmatism of clinical trials (PRECIS-2).
Competing interests: RC2NB (Research Center for Clinical Neuroimmunology and Neuroscience Basel) is supported by Foundation Clinical Neuroimmunology and Neuroscience Basel. All authors declare no competing interests.