gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Choosing an inclusion frequency for variable selection procedures performed separately on multiple imputed data sets

Meeting Abstract

Search Medline for

  • Markus Böhm - Universitätsklinikum Jena, Jena, Germany
  • Thomas Lehmann - Universitätsklinikum Jena, Jena, Germany
  • Peter Schlattmann - Universitätsklinikum Jena, Jena, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 182

doi: 10.3205/20gmds293, urn:nbn:de:0183-20gmds2932

Published: February 26, 2021

© 2021 Böhm et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Selecting an appropriate subset of a given set of possible clinical relevant predictor variables is a challenging question. This question gets even more demanding when missing values appear in the considered data set [1]. An analysis of complete cases shows disadvantages in view of prediction and inference and much information is lost in comparison to the developed alternatives [2], [3]. The established alternatives can be summarized into three major classes “Majority”, “Stack” and “Wald” [4]. We choose to investigate the “Majority” approach, in which the final model is chosen based on an inclusion frequency. This inclusion frequency represents how often a variable is selected by the chosen method on the different imputed data sets. It is often set to be either the minimal appearing frequency (every occurrence counts) or fixed by 50% or 100% occurrence frequency over all imputed data sets. In our work we investigate the different choices of these inclusion frequencies.

Methods: To accomplish our described goal, we simulate data sets. Each of these data sets has one dependent variable and several independent variables. In such an “original” data set we incorporate missing data (MAR nonresponse mechanism, [5]) in the predictor variables. The resulting data set with missings is handled using the multiple imputation by chained equations approach [4]. Hence we generate a previously fixed number of imputed data sets. For every data set we apply on each imputed data set the elastic net (EN) regression. The individual tuning parameter (α, λ) of the EN-regressions are chosen by a cross-validation on the current imputed data set. The variables with non-zero coefficients of all EN-regressions are summarized in a frequency table. Consequently, we obtain several “levels” of the earlier mentioned inclusion frequency. We compare these levels by the following idea. The “recalibration by selected predictors” proposed in [1] means essentially refitting the model via an unpenalized maximum likelihood on each imputed data set. Hence we perform for each level the multiple refitting of the models on each imputed data set. As a comparison criterion we choose the frequency level with the minimal mean BIC. This chosen level is compared to the mentioned standard levels. Here we use the MSE and coverage rates of the coefficients for each simulation setting. Finally, keeping Occam's razor in mind we suggest a preferable frequency level.

Results: In our preliminary results we find a tendency to the most parsimonious model, i.e. the frequency level equals 100%. Starting with small samples and only univariate missing data we considered 500 data sets each with 300 observations, 32 variables and a total percentage of missings of 4%. Due to the few number of missings the MSEs and coverage rates remain comparable among the frequency levels.

Conclusion: There are first indications that often the most parsimonious model with respect to the obtained frequency levels is preferable. We are currently investigating more simulation scenarios to give general recommendations regarding the frequency level to be chosen.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Thao LTP, Geskus R. A comparison of model selection methods for prediction in the presence of multiply imputed data. Biom J. 2019 Mar;61(2):343-356. DOI: 10.1002/bimj.201700232 External link
2.
Chen Q, Wang S. Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med. 2013 Sep 20;32(21):3646-59. DOI: 10.1002/sim.5783 External link
3.
Wood AM, White IR, Royston P. How should variable selection be performed with multiply imputed data? Stat Med. 2008 Jul 30;27(17):3227-46. DOI: 10.1002/sim.3177 External link
4.
van Buuren S. Flexible Imputation of Missing Data. 2nd edition. Taylor & Francis; 2018.
5.
Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd Edition. Wiley & Sons; 2002.