gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Validation and Data Dredging in Cluster Analysis

Meeting Abstract

Suche in Medline nach

  • Theresa Ullmann - Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University Munich, München, Germany
  • Anne-Laure Boulesteix - Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University Munich, München, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 81

doi: 10.3205/20gmds148, urn:nbn:de:0183-20gmds1487

Veröffentlicht: 26. Februar 2021

© 2021 Ullmann et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Clustering is an important tool for class discovery in biology and medicine. For example, researchers cluster patients with a certain illness in order to detect different subtypes of a disease, which may ultimately be useful for precision medicine.

During cluster analysis the researcher is confronted with an overwhelming number of different algorithms and methods for clustering. Often, it is not clear a priori which choices should be made for the analysis, and even once a choice is made, it may remain unclear how good the quality of the resulting clustering is.

These problems have motivated the development of “cluster validation” techniques. The literature distinguishes between internal validation (where the clustering is evaluated based on internal properties such as compactness and separateness of the clusters) and external validation (where the clustering is evaluated by comparing the clusters with respect to a variable that was not used for clustering, e.g. a survival time or a true class membership).

However, less attention has been given to the validation of cluster results on a validation dataset, for which we introduce a systematic framework. While clustering is often considered to be “explorative analysis”, a framework for the replicability of cluster results is particularly important in medicine. We discuss which part of the clustering process “validation” refers to, what clustering properties can be validated and how. A validation dataset could be part of the original dataset which was set apart before the start of the analysis, or it could be a new dataset obtained, for example, from a different study centre. While biomedical researchers frequently “validate” their cluster results on new data, these approaches have never been systematically structured and evaluated – in contrast to supervised learning models, where validation on new data is routine.

Notably, cluster validation on a validation dataset may also detect “data dredging” effects: When researchers try different cluster algorithms or parameters during the analysis, they can use classical internal and external validation methods to choose a single clustering out of these. However, when many different cluster methods are tried, then, due to the accumulation of results, an overoptimistic result might be found through multiple comparison effects. Trying to replicate the result on validation data can be useful to detect such effects.

After introducing our validation framework, we explain how it differs from related work, such as stability methods for cluster model selection. Moreover, we illustrate our framework through data analyses with a gene expression dataset from The Cancer Genome Atlas. We repeatedly split the dataset into a training and a validation set, apply different cluster algorithms (e.g. k-means, hierarchical, spectral, model-based) and parameters (e.g. number of clusters) to the training set, choose a clustering, and then perform different validation analyses using the validation set. Doing so, we also illustrate data dredging effects: For example, after randomly permuting the survival times of cancer patients and then applying many methods in order to cluster patients based on gene expression, we can easily find a method that leads to significant survival differences between clusters.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.