Article
Biomarker selection in a federated setting with data protection constraints
Search Medline for
Authors
Published: | February 26, 2021 |
---|
Outline
Text
Background: Genetic biomarkers become increasingly more important for clinical management of patients, but the discovery of new ones requires large cohorts. In many countries, data protection constraints forbid exchange of individual-level data between different research institutes, but researchers would like to share the contained information. In principle, it is still possible to exchange non-disclosive summary statistics, which is often done manually and requires explicit permission before transfer. The framework DataSHIELD enables automatic exchange in iterative calls, but methods for performing more complex tasks such as variable selection are missing.
Methods: We propose a multivariable regression modeling approach for identifying biomarkers by automatic variable selection solely based on non-disclosive aggregated data from different institutions in iterative calls. The approach should be applicable in a setting with high-dimensional data with complex correlation structures in consortia. This also implies that the amount of transferred data and the number of data calls should be limited to enable manual confirmation of compliance with data protection constraints.
We propose a regularized regression approach based on componentwise likelihood-based boosting, only requiring univariate effect estimates obtained from a linear regression for the endpoint of interest and the covariance matrix of the covariates. Additionally, we present a heuristic version of the approach with the aim of reducing number of data calls.
Results: Assuming globally standardized data, the analysis is mathematically equal to an analogue individual-level analysis. In a simulation study, the information loss introduced by a local standardization is seen to be minimal. In a typical scenario, the heuristic decreases the number of data calls from more than 10 to 3, rendering manual data releases feasible. Furthermore, we demonstrate the approach to grant protected access to a single site in an application with genome-wide DNA methylation biomarker data. Specifically, we apply the approach to brain tumor specimens data retrieved from the archive of the Institute of Neuropathology, Justus Liebig University, Gießen. We show how DNA methylation biomarkers associated with the histopathological classification can be identified without access to the individual level data.
Conclusion: Gradient-based methods can be adapted easily to a federated setting under data protection constraints. The here presented method can be used in this setting to perform automatic variable selection and can thus aid in the process of identifying new biomarkers. We provide an implementation of the heuristic version in the DataSHIELD framework.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.