gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Biomarker selection in a federated setting with data protection constraints

Meeting Abstract

  • Daniela Zöller - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Freiburg, Germany
  • Daniel Amsel - Institute of Neuropathology, Justus-Liebig-University Gießen, Medical Faculty, Gießen, Germany
  • Stefan Lenz - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Freiburg, Germany
  • Patrick Fischer - Institute of Medical Informatics, Justus-Liebig-University Gießen – Medicine Faculty, Gießen, Germany
  • Henning Schneider - Institute of Medical Informatics, Justus-Liebig-University Gießen – Medicine Faculty, Gießen, Germany
  • Hildegard Dohmen - Institute of Neuropathology, Justus-Liebig-University Gießen, Medical Faculty, Gießen, Germany
  • Till Acker - Institute of Neuropathology, Justus-Liebig-University Gießen, Medical Faculty, Gießen, Germany
  • Harald Binder - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Freiburg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 312

doi: 10.3205/20gmds310, urn:nbn:de:0183-20gmds3100

Published: February 26, 2021

© 2021 Zöller et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Genetic biomarkers become increasingly more important for clinical management of patients, but the discovery of new ones requires large cohorts. In many countries, data protection constraints forbid exchange of individual-level data between different research institutes, but researchers would like to share the contained information. In principle, it is still possible to exchange non-disclosive summary statistics, which is often done manually and requires explicit permission before transfer. The framework DataSHIELD enables automatic exchange in iterative calls, but methods for performing more complex tasks such as variable selection are missing.

Methods: We propose a multivariable regression modeling approach for identifying biomarkers by automatic variable selection solely based on non-disclosive aggregated data from different institutions in iterative calls. The approach should be applicable in a setting with high-dimensional data with complex correlation structures in consortia. This also implies that the amount of transferred data and the number of data calls should be limited to enable manual confirmation of compliance with data protection constraints.

We propose a regularized regression approach based on componentwise likelihood-based boosting, only requiring univariate effect estimates obtained from a linear regression for the endpoint of interest and the covariance matrix of the covariates. Additionally, we present a heuristic version of the approach with the aim of reducing number of data calls.

Results: Assuming globally standardized data, the analysis is mathematically equal to an analogue individual-level analysis. In a simulation study, the information loss introduced by a local standardization is seen to be minimal. In a typical scenario, the heuristic decreases the number of data calls from more than 10 to 3, rendering manual data releases feasible. Furthermore, we demonstrate the approach to grant protected access to a single site in an application with genome-wide DNA methylation biomarker data. Specifically, we apply the approach to brain tumor specimens data retrieved from the archive of the Institute of Neuropathology, Justus Liebig University, Gießen. We show how DNA methylation biomarkers associated with the histopathological classification can be identified without access to the individual level data.

Conclusion: Gradient-based methods can be adapted easily to a federated setting under data protection constraints. The here presented method can be used in this setting to perform automatic variable selection and can thus aid in the process of identifying new biomarkers. We provide an implementation of the heuristic version in the DataSHIELD framework.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.