gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

A hybrid random forest variable selection approach

Meeting Abstract

  • Cesaire Joris Kuete Fouodo - Universität zu Lübeck, Lübeck, Germany
  • Inke R. König - Universität zu Lübeck, Lübeck, Germany
  • Silke Szymczak - Universität zu Lübeck, Lübeck, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 69

doi: 10.3205/22gmds081, urn:nbn:de:0183-22gmds0812

Published: August 19, 2022

© 2022 Kuete Fouodo et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Random forests (RFs) have been shown to perform well for both classification and regression problems in high-dimensional settings and are also effective for identifying relevant predictor variables. The Vita testing procedure is a fast RF variable selection method resulting in a P-value for each predictor variable. However, depending on the chosen significance threshold, results can be unstable or contain an increased number of false-positive findings. An alternative is the iterative Boruta approach, which is more powerful but time consuming, especially with high dimensional data sets. One of the main differences between Boruta and Vita is that the Boruta approach requires an extension of the original data set with so-called shadow variables, obtained by permutations of the original predictor variables. In contrast, the Vita method only works with the original data set, by mirroring non-positive variable importance estimates around zero to create an empirical null distribution. The empirical null distribution is then used to estimate the P-values of predictor variables.

We propose a hybrid approach combining the Vita and Boruta methods. To reduce the runtime of the Boruta method, we avoid the extension of the original data set by using the idea of the Vita approach at each iteration. We conduct simulation studies based on both theoretical and experimental datasets to compare the three procedures. The first simulation setting, based on theoretical distributions, aims at mimicking common genomic data sets. It includes several thousands of predictor variables which are correlated in a block structure, such that only a small number of correlated predictor variables are predictive for the quantitative outcome. In the second setting, simulations are based on experimental gene expression data with newly simulated dichotomous outcomes. The three RF testing procedures, Vita, Boruta and hybrid, are compared using different evaluation criteria. For illustration, the three methods are also applied to the experimental data sets using the original target variables.

Results show that the hybrid approach is a good compromise between the two underlying methods. It is more stable than Vita, considerably faster than Boruta, and leads to fewer false positive findings. However, power is reduced compared to the optimal approach. An application of the three methods on experimental data sets confirms our findings.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.