Artikel
Multiple imputation in clinical prediction modeling: a systematic review of methodology
Suche in Medline nach
Autoren
Veröffentlicht: | 6. September 2024 |
---|
Gliederung
Text
Background: Missing data is a common phenomenon in medical data sets. Multiple imputation (MI) has become increasingly popular for handling cases where data are considered to be missing completely at random (MCAR) or missing at random (MAR). However, applying MI when developing a clinical prediction model can become challenging since clinical prediction modeling (CPM) should also include an internal model validation (IMV) step. In CPM context, IMV is needed to obtain a first grasp of a model generalization. This raises the problem of the optimal strategy for combining MI with IMV to obtain optimal results in terms of predictive performance, complexity and time resources.
Methods: We searched PubMed, Web of Science, and MathSciNet for “multiple imputation” and “validation” and reviewed all articles published until 12/23. We categorized them regarding the employed strategy for combining MI with IMV in the CPM context.
Results: Of the 683 articles identified and reviewed, 553 articles were excluded due to irrelevance and 22 articles mainly due to the lack of information about the employed strategy. In 92 of the remaining 108 articles (85%), authors chose to perform MI prior to IMV. MI was performed within the IMV process only in 16 of 108 articles (15%). Further, the average minimum and maximum missing rate over all variables in a data set was 0.8% (IQR: 0.3-2.89%) and 29% (IQR: 14-45%), respectively. On average, authors imputed data sets 10 times (IQR: 10-20), and in only 7 of 108 cases (6.5%) were the data sets directly combined after MI. Moreover, outcome inclusion in MI was not clearly reported in 66 of the 108 articles (61%). In the remaining articles, authors more often included outcome in MI (36 of 108 (33%)) than not (6 of 108 (5.6%)). Regarding IMV methods, the most used were bootstrapping (62 of 110 (56%)), CV (27 of 110 (25%)), and sample split (20 of 110 (18%)). Authors often opted for 200, 500 or 1000 iterations when using bootstrap (42 of 62 (68%)), 5 or 10 iterations when using CV (19 of 27 (70%)), and 66.7%, 70% or 80% train:test split ratio when using sample split (11 of 20 (55%)). The systematic review was completed with a quality control of the results.
Conclusions: Our review revealed that 85% of the articles opted for performing MI prior to IMV. Reasons therefor may be the simplicity of the strategy and the potentially lower time expenditure compared to applying MI within the MI process. However, it remains unclear whether the authors are aware of the potential influence of the chosen strategy on the predictive performance of downstream models. Given the scarcity of literature on this topic, this question will be the objective of a future comprehensive simulation study.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.