gms | German Medical Science

Information Retrieval Meeting (IRM 2022)

10.06. - 11.06.2022, Cologne

When can we stop screening studies? A cross-institutional simulation study

Meeting Abstract

  • presenting/speaker Ashley Elizabeth Muller - Norwegian Institute of Public Health, Norway
  • presenting/speaker Kathryn Hopkins - National Institute for Health and Care Excellence, United Kingdom
  • presenting/speaker Evangelos Kanoulas - University of Amsterdam, The Netherlands
  • presenting/speaker Iain Marshall - Kings College London, United Kingdom
  • presenting/speaker Emma McFarlane - National Institute for Health and Care Excellence, United Kingdom
  • presenting/speaker Alison O’Mara-Eves - University College London, United Kingdom
  • presenting/speaker Mark Stevenson - University of Sheffield, United Kingdom
  • corresponding author presenting/speaker James Thomas - University College London, United Kingdom

Information Retrieval Meeting (IRM 2022). Cologne, 10.-11.06.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. Doc22irm23

doi: 10.3205/22irm23, urn:nbn:de:0183-22irm236

Published: June 8, 2022

© 2022 Muller et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Screening studies is one of most resource-intensive phases of a systematic review: two reviewers independently assess potentially thousands of studies, though only a small fraction of these studies are relevant. Some review software packages contain a ranking algorithm that pushes the most relevant unscreened studies to the top of a reviewer’s screening list. If priority screening works effectively, relevant studies are identified and screened earliest, while irrelevant studies to be excluded are seen last (Figure 1 [Fig. 1]). To optimise use of priority screening in reviews and reduce manual screening workload, reviewers would ideally stop screening once the list transitions from relevant to irrelevant studies. A number of approaches have been proposed to identify a suitable “stopping point” in the ranking that minimises both the reviewing workload and risk of relevant studies being missed, but they have not been evaluated on a wide range of reviews. We evaluated whether these statistical stopping criteria can be applied to identify a point where the trade-off between human effort and loss of recall is optimal.

Methods: The National Institute of Health and Care Excellence, Norwegian Institute of Public Health, University College London/EPPI Centre, and researchers from Kings College London, University of Sheffield, and University of Amsterdam contributed datasets from 159 completed reviews. These reviews screened 43 to 75,000+ studies and covered a range of topics, from medical interventions to welfare policies. Using the prioritisation algorithm deployed in EPPI-Reviewer, we simulated the operation of priority screening after randomly sampling from included and excluded studies, with 10 simulated iterations for each review to control for differences in seed documents. Six different statistical stopping criterion algorithms developed by Univ College London, Univ Amsterdam and Univ Sheffield were then applied to the simulations. Eight experimental binary settings were tested (see Table 1 [Tab. 1]). Performance was tested at four levels of target recall (.90, .95, .99, and 1.0) across the following outcome metrics: recall, F-measure, worked saved over sampling, and risk.

Preliminary results and conclusion: In ongoing analyses, different stopping criterion algorithms perform slightly differently under the various experimental settings, but all have the potential to substantially reduce manual effort – the extent of which depends on the target recall level. We will indicate conditions in which the various algorithms may be more appropriate.

Keywords: screening prioritisation, semi-automated screening, statistical stopping criteria, stopping rules