gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

Simulation-based selection of a propensity-score weighting approach to adjust for unexpected self-selection in a cohort study – results from the PRAIM study

Meeting Abstract

  • Nora Eisemann - Universität zu Lübeck, Lübeck, Germany
  • Stefan Bunk - Vara, Berlin, Germany
  • Hannah Baltus - Universität zu Lübeck, Lübeck, Germany
  • Christian Leibig - Vara, Berlin, Germany
  • Trasias Mukama - Vara, Berlin, Germany
  • Alexander Katalinic - Universität Lübeck, Lübeck, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 245

doi: 10.3205/24gmds067, urn:nbn:de:0183-24gmds0676

Veröffentlicht: 6. September 2024

© 2024 Eisemann et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The PRAIM study is an observational non-inferiority study comparing the performance of AI-supported double reading with standard double-reading (without AI) among women (50–69 years) undergoing organised mammography screening in twelve screening sites in Germany. The AI system provided in the viewer (1) “normal” tags to indicate a subset of most likely unsuspicious examinations and (2) localisation prompts to trigger reassessment for highly suspicious examinations if they were judged unsuspicious by the radiologists. Primary endpoint was screen-detected breast cancer rate.

During data collection, it was learned radiologists sometimes chose the viewer for the final report of the examination depending on the triaging prediction (normal/not-normal), which is visible in advance. As the consensus conference is more convenient if all participating radiologists have used the same viewer, some radiologists tended to report only normal-tagged cases in the AI-supported viewer and switched to other viewers otherwise. This confounding procedure was not anticipated in the original study setup and could introduce severe bias.

Methods: A simulation study, reflecting the expected real situation as closely as possible, was conducted to identify a statistical method that successfully corrects for the self-selection bias.

Five bias scenarios with different degrees of bias (i.e. a different probability of radiologists finalizing the examination using the AI-supported viewer, given the AI normal-prediction) were considered.

Four regression methods, all with a quasibinomial error distribution, were applied: regression without adjustment, with inverse propensity score weighting (IPW), with IPW after trimming of extreme propensity scores (PS), and with overlap PS weighting. PS depended on the minimal adjustment set identified in a causal graph analysis (the radiologists assessing the examination, and the normal-prediction). For each scenario, 1000 data sets were simulated. Risk ratios (RR) and power to detect non-inferiority were calculated.

Results: The regression with overlap PS weighting was on average able to estimate the RR without relevant bias and had a minimal power of 79·7% (and power of 94·6% for a reading behaviour as observed in preliminary usage data) to reject inferiority of AI for all considered scenarios, independent from the direction and extent of the simulated self-selection bias. The IPW trimming approach was also nearly unbiased but had wider confidence intervals and thus lower power. Both other approaches were clearly biased.

Discussion: IPW is a standard method to deal with bias in intervention allocation in cohort studies. If overlap in the baseline variables of the intervention groups is poor, PS weights can be extreme and result in biased estimates. An ad-hoc solution is the trimming of extreme PS. A more recent development are overlap weights, which target the population with the most overlap on observed covariates.

However, it has to be noted that the approaches target different estimands: the unadjusted and the IPW regression target the average treatment effect in the entire population (ATE), the trimmed IPW regression in the trimmed population and the overlap weighted regression in the overlap population (ATO).

Conclusion: The overlap weighting approach had the smallest bias and the largest power and was chosen for the main analysis.

Competing interests: The study was funded by Vara. SB, TM, and CL are current employees of Vara with stock options as part of the standard compensation package. AK received a payment from Vara for general consulting and speaker fees.

The authors declare that a positive ethics committee vote has been obtained.