gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

Multiple imputation in clinical prediction modeling: a systematic review of methodology

Meeting Abstract

Search Medline for

  • Sinclair Awounvo - Institute of Medical Biometry, University of Heidelberg, Heidelberg, Germany
  • Manuel Feißt - Institute of Medical Biometry, University of Heidelberg, Heidelberg, Germany
  • Meinhard Kieser - Institute of Medical Biometry, University of Heidelberg, Heidelberg, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 819

doi: 10.3205/24gmds098, urn:nbn:de:0183-24gmds0987

Published: September 6, 2024

© 2024 Awounvo et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Missing data is a common phenomenon in medical data sets. Multiple imputation (MI) has become increasingly popular for handling cases where data are considered to be missing completely at random (MCAR) or missing at random (MAR). However, applying MI when developing a clinical prediction model can become challenging since clinical prediction modeling (CPM) should also include an internal model validation (IMV) step. In CPM context, IMV is needed to obtain a first grasp of a model generalization. This raises the problem of the optimal strategy for combining MI with IMV to obtain optimal results in terms of predictive performance, complexity and time resources.

Methods: We searched PubMed, Web of Science, and MathSciNet for “multiple imputation” and “validation” and reviewed all articles published until 12/23. We categorized them regarding the employed strategy for combining MI with IMV in the CPM context.

Results: Of the 683 articles identified and reviewed, 553 articles were excluded due to irrelevance and 22 articles mainly due to the lack of information about the employed strategy. In 92 of the remaining 108 articles (85%), authors chose to perform MI prior to IMV. MI was performed within the IMV process only in 16 of 108 articles (15%). Further, the average minimum and maximum missing rate over all variables in a data set was 0.8% (IQR: 0.3-2.89%) and 29% (IQR: 14-45%), respectively. On average, authors imputed data sets 10 times (IQR: 10-20), and in only 7 of 108 cases (6.5%) were the data sets directly combined after MI. Moreover, outcome inclusion in MI was not clearly reported in 66 of the 108 articles (61%). In the remaining articles, authors more often included outcome in MI (36 of 108 (33%)) than not (6 of 108 (5.6%)). Regarding IMV methods, the most used were bootstrapping (62 of 110 (56%)), CV (27 of 110 (25%)), and sample split (20 of 110 (18%)). Authors often opted for 200, 500 or 1000 iterations when using bootstrap (42 of 62 (68%)), 5 or 10 iterations when using CV (19 of 27 (70%)), and 66.7%, 70% or 80% train:test split ratio when using sample split (11 of 20 (55%)). The systematic review was completed with a quality control of the results.

Conclusions: Our review revealed that 85% of the articles opted for performing MI prior to IMV. Reasons therefor may be the simplicity of the strategy and the potentially lower time expenditure compared to applying MI within the MI process. However, it remains unclear whether the authors are aware of the potential influence of the chosen strategy on the predictive performance of downstream models. Given the scarcity of literature on this topic, this question will be the objective of a future comprehensive simulation study.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592.