GMS | 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF) | A new resampling strategy improves proximity estimation with the unsupervised random forest algorithm

66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

26. - 30.09.2021, online

Artikel

XML Version

Artikel empfehlen

A new resampling strategy improves proximity estimation with the unsupervised random forest algorithm

Meeting Abstract

Suche in Medline nach

Cesaire Joris Kuete Fouodo - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
Silke Szymczak - Universität zu Lübeck, Lübeck, Germany
Inke R. König - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 26.-30.09.2021. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 139

doi: 10.3205/21gmds089, urn:nbn:de:0183-21gmds0898

Veröffentlicht:	24. September 2021

© 2021 Kuete Fouodo et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.

Gliederung

Text

Random forests (RF) are fast and perform well in high dimensional classification problems. In precision medicine, another use of large scale data is to stratify individuals into homogeneous subgroups. For this unsupervised learning setting, unsupervised random forests (URF) can be used to compute dissimilarities between individuals [1], which can then be used as input for clustering algorithms. The crucial step of URF is the synthetization of an artificial dataset by resampling original values of the individuals. The two data sets are combined and the standard RF algorithm can be used to classify observations as original or artificial. Dissimilarities between each pair of individuals can be obtained by counting how often they end up in the same terminal nodes across the forest.

We review the resampling approaches proposed by Shi and Horvath [1], explain their limitations and propose an new intuitive strategy based on the low density regions (LDR) of the marginal distribution of each variables. We perform a simulation study to compare the different approaches. The results show that resampling original data from LDR improves the quality of dissimilarities between individuals and leads to more homogeneous clusters.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.

Gliederung

References

1.: Shi T,Â HorvathÂ S. Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics.Â 2006;15(1):118–38.

gms | German Medical Science

66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

Artikel

A new resampling strategy improves proximity estimation with the unsupervised random forest algorithm

Suche in Medline nach

Autoren

Gliederung

Text

References