Article
Predictions by Random Forests – Confidence Intervals and their Coverage Probabilities
Search Medline for
Authors
Published: | February 26, 2021 |
---|
Outline
Text
Background: Random forests are a popular supervised learning method which were first proposed by Breiman [1]. Their main purpose is the robust prediction of an outcome based on a learned set of rules. To evaluate the precision of predictions their scattering and distributions are important. In order to quantify this, 95 % confidence intervals for the predictions can be generated using suitable variance estimators. However, these variance estimators may be under- or overestimated and the confidence intervals thus cover ranges either too small or too large. This can be evaluated by estimating coverage probabilities through simulations.
Methods: The aim of our study was to examine coverage probabilities for two variance estimators for predictions made by random forests, the infinitesimal jackknife according to Wager et al. [2] and the fixed-point based variance estimator according to Mentch and Hooker (2016). According to our research this is the first comparative simulation study which has been conducted for these variance estimators. In our comparison we considered different scenarios with varying sample sizes and various signal-to-noise ratios. Further we analysed the performance of these variance estimators based on a real dataset. We used a stroke dataset from the German Stroke Study Collaboration [3] and evaluated the coverage probabilities for different forest sizes and predictions.
Results: Our results show that the coverage probabilities based on the infinitesimal jackknife are lower than the desired 95 % for small data sets and small random forests. On the other hand, the variance estimator according to Mentch and Hooker [4] leads to overestimated coverage probabilities. However, a growing number of trees yields decreasing coverage probabilities for both methods. A similar behaviour was observed when using real datasets, where the composition of the data and the number of trees influence the coverage probabilities.
Conclusion: In conclusion, we observed that the relative performance of one variance estimation method over the other depends on the hyperparameters used for training the random forest. Likewise, the coverage probabilities can be used to evaluate how well the hyperparameters were chosen and whether the data set requires more pre-processing.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Breiman L. Random Forests. Mach Learn. 2001;45:5–32.
- 2.
- Wager S, Hastie T, Efron B. Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. J Mach Learn Res. 2014;15:1625-1651.
- 3.
- German Stroke Study Collaboration. Predicting Outcome after Acute Ischemic Stroke: An External Validation of Prognostic Models. Neurology. 2004;62:581–585.
- 4.
- Mentch L, Hooker G. Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. J Mach Learn Res. 2016;17:1-41.