Artikel
Refining AI-assisted abstract screening: ChatGPT’s performance and the impact of structured eligibility criteria
Suche in Medline nach
Autoren
Veröffentlicht: | 27. März 2025 |
---|
Gliederung
Text
Background/research question: Scoping reviews are labor-intensive, particularly during the initial screening phase. Given the exponential growth in the absolute number of publications, time spent on conducting scoping reviews is expected to increase. While systematic reviews are frequently supported by machine learning tools, there is a lack of tools specifically tailored to support screening processes for scoping reviews. However, large language models, like ChatGPT, have shown promising results in accelerating this process. This study aims to analyse the impact of the structure and phrasing of the eligibility criteria on ChatGPT's performance in screening abstracts for a scoping review on digital tools supporting interprofessional interactions.
Methods: We conducted a thematic analysis of ChatGPT 4.0’s explanations from an abstract classification exercise of the aforementioned scoping review (15.307 abstracts) and developed three refined sets of eligibility criteria (narrow, wide, and balanced). Using ChatGPT 4.0 with these three criteria, we rerun the abstract classification exercise and calculated performance metrics such as sensitivity, specificity, and accuracy based on the gold standard of two independent human reviewers. Additionally, we combined decisions on different eligibility sets using majority voting and human conflict resolution to assess their performance.
Results: The analysis of the eligibility criteria revealed challenges in each category of the population, concept, and context framework: the complexity of healthcare provider interactions (population), ambiguities in defining digital tools (concept), and difficulties in identifying the healthcare setting (context). This resulted in three sets of eligibility criteria: The wide set achieved the highest sensitivity, while the narrow set had the highest specificity. Combining sets effectively balanced sensitivity and specificity, flagging ambiguous abstracts for manual review. The combination of the wide and narrow set led to a lower overall workload compared to other combinations.
Conclusion: Refining and structuring eligibility criteria improved ChatGPT’s accuracy, with additional improvement of the accuracy by combining different sets. This highlights the importance of well-defined and structured eligibility when employing large language models for screening purposes. However, further research is needed to address human oversight, trust, and transparency issues before their full integration into the review process.
Competing interests: N/A