Artikel
A new resampling strategy improves proximity estimation with the unsupervised random forest algorithm
Suche in Medline nach
Autoren
Veröffentlicht: | 24. September 2021 |
---|
Gliederung
Text
Random forests (RF) are fast and perform well in high dimensional classification problems. In precision medicine, another use of large scale data is to stratify individuals into homogeneous subgroups. For this unsupervised learning setting, unsupervised random forests (URF) can be used to compute dissimilarities between individuals [1], which can then be used as input for clustering algorithms. The crucial step of URF is the synthetization of an artificial dataset by resampling original values of the individuals. The two data sets are combined and the standard RF algorithm can be used to classify observations as original or artificial. Dissimilarities between each pair of individuals can be obtained by counting how often they end up in the same terminal nodes across the forest.
We review the resampling approaches proposed by Shi and Horvath [1], explain their limitations and propose an new intuitive strategy based on the low density regions (LDR) of the marginal distribution of each variables. We perform a simulation study to compare the different approaches. The results show that resampling original data from LDR improves the quality of dissimilarities between individuals and leads to more homogeneous clusters.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Shi T, Horvath S. Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics. 2006;15(1):118–38.