gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Data dredging in ranking analyses

Meeting Abstract

Search Medline for

  • Christina Nießl - Institute for Medical Information Processing, Biometry, and Epidemiology at Ludwig Maximilian University Munich, Munich, Germany
  • Anne-Laure Boulesteix - Institute for Medical Information Processing, Biometry, and Epidemiology at Ludwig Maximilian University Munich, Munich, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 234

doi: 10.3205/20gmds303, urn:nbn:de:0183-20gmds3030

Published: February 26, 2021

© 2021 Nießl et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Variable rankings play an important role in biomedical studies that investigate high-dimensional molecular data. In order to generate a variable ranking, the researcher has to make several decisions regarding the analysis approach. This does not only include the ranking criterion itself but also, for example, data preparation steps or the choice of tuning parameters.

Although the multitude of possible analysis approaches is an important issue in all research fields, it is particularly relevant in ranking analyses involving high-dimensional data. This is because small modifications of the ranking procedure can lead to a completely different ordering of the variables.

As a consequence, researchers might be tempted to apply several ranking procedures until one of them returns a satisfying result. For example, a satisfying result could be a small rank for a specific variable that the researcher expects to be relevant based on biological knowledge. The practice of choosing the analysis approach depending on its results is generally referred to as data dredging or fishing expeditions (or fishing for significance in the case of testing) and can lead to a substantial optimistic bias. It can be expected that most researchers performing data dredging are doing so subconsciously or are not fully aware of the consequences.

To raise awareness of data dredging in ranking analyses, it would be useful to give researchers a concrete idea of how unstable ranking results are with respect to the analysis approach and what the consequences of this instability in terms of data dredging are. For this purpose, we formalize data dredging in the context of ranking analyses and develop a framework to quantify data dredging effects. The framework is illustrated using methods for differential expression analysis on high-dimensional gene data sets.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Boulesteix AL, Hornung R, Sauerbrei W. On fishing for significance and statistician's degree of freedom in the era of big molecular data. In: Pietsch W, Wernecke J, Ott M, editors. Berechenbarkeit der Welt? Wiesbaden: Springer; 2017. p. 155-170.
2.
Boulesteix AL, Slawski M. Stability and aggregation of ranked gene lists. Briefings in Bioinformatics. 2009;10(5):556-568.
3.
Ioannidis JPA. Why most published research findings are false. PLoS Medicine. 2005;2(8):e124.
4.
Klau S, Martin-Magniette ML, Boulesteix AL, Hoffmann S. Sampling uncertainty versus method uncertainty: A general framework with applications to omics biomarker selection. Biometrical Journal. 2019;62(3):670-687. DOI: 10.1002/bimj.201800309 External link