Artikel
A novel dissimilarity measure based on unsupervised random forests
Suche in Medline nach
Autoren
Veröffentlicht: | 26. Februar 2021 |
---|
Gliederung
Text
Genome-wide association studies (GWAS) have in the past been successful in the identification of associations with well-defined phenotypes as well as in the establishment of supervised classification models based on univariable analyses. To perform multivariable genome-wide analyses, random forests (RF) have been shown to be fast and to produce good predictive performances in high dimensional classification problems. In precision medicine, another use of genome-wide data without well-defined phenotypes may be to identify genetically similar individuals. For these cluster analyses, unsupervised random forests (URF) have been proposed to evaluate dissimilarities between individuals. Compared with standard distance measures, these have the advantage to easily handle both continuous and categorical high-dimensional variables. In this work, we review the way URF based dissimilarities are computed, show their limitations, and propose a new distance approach, based on depth of trees. Using the adjusted rand index and silhouette scores to evaluate cluster qualities, we show, using simulations, that our approach performs better than the original URF based approach. We also compare the two approaches using real data.
The authors declare that they have no competing interests.
The authors declare that a positive ethics committee vote has been obtained.