GMS | 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS) | Predictions by Random Forests – Confidence Intervals and their Coverage Probabilities

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Article

XML version

Send article

Predictions by Random Forests – Confidence Intervals and their Coverage Probabilities

Meeting Abstract

Search Medline for

Diana Kormilez - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
Björn-Hergen Laabs - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
Inke R. König - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 85

doi: 10.3205/20gmds185, urn:nbn:de:0183-20gmds1857

Published:	February 26, 2021

© 2021 Kormilez et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.

Outline

Text

Background: Random forests are a popular supervised learning method which were first proposed by Breiman [1]. Their main purpose is the robust prediction of an outcome based on a learned set of rules. To evaluate the precision of predictions their scattering and distributions are important. In order to quantify this, 95 % confidence intervals for the predictions can be generated using suitable variance estimators. However, these variance estimators may be under- or overestimated and the confidence intervals thus cover ranges either too small or too large. This can be evaluated by estimating coverage probabilities through simulations.

Methods: The aim of our study was to examine coverage probabilities for two variance estimators for predictions made by random forests, the infinitesimal jackknife according to Wager et al. [2] and the fixed-point based variance estimator according to Mentch and Hooker (2016). According to our research this is the first comparative simulation study which has been conducted for these variance estimators. In our comparison we considered different scenarios with varying sample sizes and various signal-to-noise ratios. Further we analysed the performance of these variance estimators based on a real dataset. We used a stroke dataset from the German Stroke Study Collaboration [3] and evaluated the coverage probabilities for different forest sizes and predictions.

Results: Our results show that the coverage probabilities based on the infinitesimal jackknife are lower than the desired 95 % for small data sets and small random forests. On the other hand, the variance estimator according to Mentch and Hooker [4] leads to overestimated coverage probabilities. However, a growing number of trees yields decreasing coverage probabilities for both methods. A similar behaviour was observed when using real datasets, where the composition of the data and the number of trees influence the coverage probabilities.

Conclusion: In conclusion, we observed that the relative performance of one variance estimation method over the other depends on the hyperparameters used for training the random forest. Likewise, the coverage probabilities can be used to evaluate how well the hyperparameters were chosen and whether the data set requires more pre-processing.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.

Outline

References

1.: Breiman L. Random Forests. Mach Learn. 2001;45:5–32.
2.: Wager S, Hastie T, Efron B. Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. J Mach Learn Res. 2014;15:1625-1651.
3.: German Stroke Study Collaboration. Predicting Outcome after Acute Ischemic Stroke: An External Validation of Prognostic Models. Neurology. 2004;62:581–585.
4.: Mentch L, Hooker G. Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. J Mach Learn Res. 2016;17:1-41.

gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

Article

Predictions by Random Forests – Confidence Intervals and their Coverage Probabilities

Search Medline for

Authors

Outline

Text

References