gms | German Medical Science

63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

02. - 06.09.2018, Osnabrück

Architectures for distributed privacy-preserving deep learning

Meeting Abstract

Search Medline for

  • Stefan Lenz - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Deutschland
  • Daniela Zöller - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Deutschland
  • Moritz Hess - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Deutschland
  • Harald Binder - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Deutschland

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018. DocAbstr. 207

doi: 10.3205/18gmds097, urn:nbn:de:0183-18gmds0978

Published: August 27, 2018

© 2018 Lenz et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Distributed privacy-preserving analysis is needed for joint analysis of data that is distributed among several sites and that is not allowed to leave the boundaries of the sites due to privacy protection laws and security considerations. In established meta-analytic approaches for such settings, only aggregate statistics from the different data sites are pooled. As an alternative, certain statistical models can be estimated by performing several iterations with interim statistics across the sites. This is the approach that is implemented for generalized linear regression models in the DataSHIELD software platform [1]. In the context of the MIRACUM consortium [2], we plan to realize more complex approaches such as deep learning techniques, which have not been practically implemented in such a setting thus far.

Methods: We investigate how partitioned deep Boltzmann machines (DBMs) [3] can be adapted for distributed privacy-preserving analysis. The basis for estimation is stochastic gradient descent, where the gradient of the target function is computed for each sample or for batches of samples. Then the model parameters are updated by a small amount. As the data is used in small chunks for the training, the algorithm can be used for training on distributed data in a natural way. Only the interim models must be communicated, and they do not reveal individual data. We specifically consider different ways of communicating these interim models, aiming to minimize information loss. We also consider how these approaches can be implemented with a software platform such as DataSHIELD. The latter already provides the infrastructure for distributed privacy-preserving analysis. This is implemented using data frames from the statistical environment R, which is not ideal for the high-volume molecular measurement data to be analyzed using DBMs. Thus, we also investigate a corresponding extension.

Results: Using simulated data, mimicking structures found for single nucleotide polymorphism (SNP) DNA measurements, we present performance regarding unsupervised detection of patterns. Specifically, the performance criterion is whether DBMs, trained by different distributed approaches, can generate artificial data similar to the true patterns. Firstly, we present results for joint gradient descent across sites. As a second approach, separate models are trained at the different sites, and a final model is trained with data generated from separate models. We highlight how these approaches can be efficiently implemented for the DataSHIELD platform, using matrices in addition to R data frames for analysis and transfer.

Discussion: Our results show the feasibility of distributed privacy-preserving training of DBMs and their ability to generate artificial data that can be used for statistical modeling in place of the originally observed data. Also, a viable software implementation strategy is presented that allows for using such approaches in practice.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Budin-Ljøsne I, Burton P, Isaeva J, Gaye A, Turner A, Murtagh MJ, et al. DataSHIELD: An Ethically Robust Solution to Multiple-Site Individual-Level Data Analysis. Public Health Genomics. 2015;18(2):87–96.
2.
MIRACUM – Medical Informatics in Research and Medicine. Available from: http://www.miracum.org/ External link
3.
Hess M, Lenz S, Blätte TJ, Bullinger L, Binder H. Partitioned learning of deep Boltzmann machines for SNP data. Bioinformatics. 2017 Oct 15;33(20):3173–80.