gms | German Medical Science

66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

26. - 30.09.2021, online

Identification of representative trees in random forests based on a new tree-based distance measure

Meeting Abstract

Search Medline for

  • Björn-Hergen Laabs - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
  • Ana Westenberger - Institute of Neurogenetics, Universität zu Lübeck, Lübeck, Germany
  • Inke R. König - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 26.-30.09.2021. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 174

doi: 10.3205/21gmds008, urn:nbn:de:0183-21gmds0089

Published: September 24, 2021

© 2021 Laabs et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: In life sciences random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. Typically, variable importance measures are used to evaluate the relevance of variables for the outcome. However, they can neither explain how a variable influences the outcome nor find interactions between variables; furthermore, they ignore the tree structure in the forest completely. A different approach is to select a single or a set of a few trees from the ensemble which best represent the forest. By simplifying a complex ensemble of decision trees to a set of a few representative trees, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants.

Methods: Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. The currently proposed tree-based distance metrics [1] compare trees by the prediction, the clustering in the terminal nodes or by which variables were used for splitting. Therefore, they either need an additional data set for calculating the distances or capture just a small proportion of the tree architecture. Thus, we developed a new tree-based distance measure, which does not use an additional data set and incorporates more of the tree structure, by evaluating not only whether a certain variable was used for splitting in the tree, but also where in the tree it was used. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set of patients with X-linked dystonia-parkinsonism.

Results: We show that our new distance metric is superior in depicting the differences in tree structures, while it is one of the best measures in representing the prediction of the complete random forest. Furthermore, we found that the most representative tree selected by our method has the best prediction performance on independent validation data compared to the trees selected by other metrics. Additionally, we observed that in most scenarios, a subset of three to five most representative trees gives more accurate predictions than the complete ensemble.

Discussion: In our simulation study we were able to show the advantages of or weighted splitting variable approach and observed that removing poorly grown trees could be a further use case of tree-based distance measures. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study [2], but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the package timbR (https://github.com/imbs-hl/timbR).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.

Der Beitrag wurde bereits publiziert: [3]


References

1.
Banerjee M, Ding Y, Noone A-M. Identifying representative trees from ensembles. Stat Med. 2012;31(15):1601-16.
2.
Laabs BH, Klein C, Pozojevic J, Domingo A, Brüggemann N, Grütz K, et al. Identifying genetic modifiers of age-associated penetrance in X-linked dystonia-parkinsonism. Nat Commun. 2021;12:3216. DOI: 10.1038/s41467-021-23491-4 External link
3.
Laabs BH, König IR. Identification of representative trees in random forests based on a new tree-based distance measure. In: 67. Biometrisches Kolloquium; 2021; Münster. p. 24. Available from: https://www.biometrisches-kolloquium2021.de/wp-content/uploads/2021/03/BK2021_Book_of_Abstracts_updated.pdf External link