gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Genome-wide conditional independence testing with machine learning

Meeting Abstract

Search Medline for

  • Marvin N. Wright - Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany
  • David S. Watson - University of Oxford, Oxford, United Kingdom

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 50

doi: 10.3205/20gmds069, urn:nbn:de:0183-20gmds0698

Published: February 26, 2021

© 2021 Wright et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

In genetic epidemiology, we are facing extremely high dimensional data and complex patterns such as gene-gene or gene-environment interactions. For this reason, it is promising to use machine learning instead of classical statistical methods to analyze such data. However, most methods for statistical inference with machine learning test against a marginal null hypothesis and by that cannot handle correlated predictor variables.

Building on the knockoff framework of Candès et al. [1], we propose the conditional predictive impact (CPI), a provably consistent and unbiased estimator of a variables' association with a given outcome, conditional on a reduced set of predictor variables. The method works in conjunction with any supervised learning algorithm and loss function. Simulations confirm that our inference procedures successfully control type I error and achieve nominal coverage probability with greater power than alternative variable importance measures and other nonparametric tests of conditional independence. We apply our method to a gene expression dataset on breast cancer. Further, we propose a modification which avoids the computation of the high-dimensional knockoff matrix and is computationally feasible on data from genome-wide association studies.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Candès E, Fan Y, Janson L, Lv J. Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J Royal Stat Soc Ser B Methodol. 2018;80:551–577.