gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

A novel dissimilarity measure based on unsupervised random forests

Meeting Abstract

  • Cesaire Joris Kuete Fouodo - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany, Luebeck, Germany
  • Inke R. König - Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
  • Marvin N. Wright - Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 478

doi: 10.3205/20gmds365, urn:nbn:de:0183-20gmds3657

Published: February 26, 2021

© 2021 Kuete Fouodo et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Genome-wide association studies (GWAS) have in the past been successful in the identification of associations with well-defined phenotypes as well as in the establishment of supervised classification models based on univariable analyses. To perform multivariable genome-wide analyses, random forests (RF) have been shown to be fast and to produce good predictive performances in high dimensional classification problems. In precision medicine, another use of genome-wide data without well-defined phenotypes may be to identify genetically similar individuals. For these cluster analyses, unsupervised random forests (URF) have been proposed to evaluate dissimilarities between individuals. Compared with standard distance measures, these have the advantage to easily handle both continuous and categorical high-dimensional variables. In this work, we review the way URF based dissimilarities are computed, show their limitations, and propose a new distance approach, based on depth of trees. Using the adjusted rand index and silhouette scores to evaluate cluster qualities, we show, using simulations, that our approach performs better than the original URF based approach. We also compare the two approaches using real data.

The authors declare that they have no competing interests.

The authors declare that a positive ethics committee vote has been obtained.