Article
Efficient permutation testing of variable importance measures in machine learning
Search Medline for
Authors
Published: | September 15, 2023 |
---|
Outline
Text
Introduction: Variable importance measures (VIMPs) are a popular means of assessing the relevance of a predictor variable in a prediction model. VIMPs are particularly useful for gaining insight into machine learning models, which are often referred to as black boxes. There have also been many attempts to assess the statistical significance of VIMPs through hypothesis testing, e.g. to perform variable selection or to identify prognostic and predictive factors. Especially in Random Forests (RF), which serve as an application example here, this topic remains subject of ongoing research. Heuristic approaches to parametric testing have been proposed. However, they often rely on distributional assumptions from empirical evidence. More recently, formal tests have been derived analytically, but can be computationally expensive or even infeasible in practice. In an own work, nonparametric permutation testing has been proposed as a very general and distribution-free approach that can be applied to any type of model and VIMP [1]. However, it shares the problem of limited computational feasibility, especially when using computationally expensive prediction models, VIMP or Big Data.
Methods: To address the feasibility issue of conventional permutation testing, we propose to use sequential permutation testing and sequential p-value estimation [2]. We use the popular permutation VIMP measure of RF, both of which are computationally expensive, to demonstrate the practicality and relevance of our approach. Several simulation studies were performed to investigate whether the theoretical properties of statistical tests hold when sequential methods are applied. The Pima Indians Diabetes Database was used to investigate the numerical stability of the methods in a well-known setting. An additional application to data from a SARS-CoV-2 diagnostic study was used to illustrate the potentially huge savings in computational costs.
Results: The theoretical properties of the methods were met in the simulation studies. The type-I error probability was controlled at the nominal level. High power was maintained (≥97% compared to conventional permutation testing). Considerably fewer permutations were required (e.g. ≤40 instead of a maximum of 500 under H0 in the simulation studies and 18.6% of 500 in the application study). The numerical stability of results was problematic for variables with “borderline” significance, but could be improved by reducing the additional variability introduced by the model building and estimation of VIMP.
Discussion: The sequential methods showed an error control and power that was almost as good as for conventional permutation testing. They can therefore be recommended to assess the statistical significance of VIMP at considerably reduced computational cost. When using RF as prediction model, a large number of trees should be used to obtain stable results. Although RF's permutation VIMP has been used here as a relevant example, the proposed methods can be applied to any kind of prediction model or VIMP.
Conclusion: Theoretically sound sequential p-value estimation and permutation testing of VIMPs is possible and less computationally expensive than conventional permutation testing approaches. In the case of complex prediction models, VIMPs and Big Data, the proposed methods can lead to considerable savings. An implementation is provided in the R package ‘rfvimptest’ on the Comprehensive R Archive Network (CRAN).
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Hapfelmeier A, Ulm K. A new variable selection approach using random forests. Computational Statistics & Data Analysis. 2013;60:50-69.
- 2.
- Hapfelmeier A, Hornung R, Haller B. Efficient permutation testing of variable importance measures by the example of random forests. Computational Statistics & Data Analysis. 2023. pii: 107689.