gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

Implementing deep learning with Boltzmann machines in DataSHIELD

Meeting Abstract

Suche in Medline nach

  • Stefan Lenz - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Freiburg, Germany
  • Harald Binder - Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center – University of Freiburg, Freiburg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 133

doi: 10.3205/19gmds062, urn:nbn:de:0183-19gmds0627

Veröffentlicht: 6. September 2019

© 2019 Lenz et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Data protection imposes strong restrictions on the use of medical data. Data sharing at the level of individual patient data is often not possible. Therefore, techniques for privacy-preserving analysis are needed when data is distributed among several sites and cannot be pooled. DataSHIELD [1] is a software tool, which is used by many multicentre studies in such settings. Many standard statistical analyses can be performed via DataSHIELD without individual data leaving the sites. This is possible by using reformulated algorithms that solely rely on aggregated statistics without information on individuals. Another approach for distributed analyses becomes possible with so-called generative models. Once trained on a data set, a generative model has the ability to generate new, synthetic data that preserve the structure of original data but do not contain records that are linked to real individuals. These synthetic data can then be shared across the sites for joint analyses.

Methods: We have particularly investigated deep Boltzmann machines, a special class of neural networks, as a generative model. Boltzmann machines proved to be a viable machine learning technique even in settings with low sample size [2], [3], which is common in medical research. We developed a package for the Julia programming language that allows a user-friendly interface for training and evaluating deep Boltzmann machines. To make deep learning with Boltzmann machines possible in a distributed setting, we employ DataSHIELD, which allows developers to add new functionality via R functions.

Results: We extend DataSHIELD with an R package that can be used to train deep Boltzmann machines and then generate synthetic data from trained models. To make the features of our Julia package available in DataSHIELD, we have implemented a generic interface that allows using Julia functions in R. In addition to the technical challenges, there are some methodological adjustments for using Boltzmann machines – or neural networks in general – in a privacy-preserving setting. Since the number of parameters in the models can be higher than the number of measurements in the data, there may be some concerns about a potential information disclosure via the model parameters. Therefore, we chose that models are not allowed to leave the site. An additional challenge, common to neural networks, is that the training requires extensive hyperparameter tuning. So, in this case, one must be able to evaluate models without a direct access to them. For this purpose, our software provides different metrics to assess the model quality during and after the training. This monitoring output can be displayed to the DataSHIELD client without privacy issues, even if the number of training attempts is high, because it does not contain information about individual patient data. After a successful training, the final model can then be used to generate synthetic data that is handed to the researcher.

Discussion: We show the specific implementation as well as the general approach for using generative models in a distributed setting. The resulting software is planned to be used for joint analysis of patient data in the MIRACUM consortium [4].

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Budin-Ljøsne I, Burton P, Isaeva J, Gaye A, Turner A, Murtagh MJ, et al. DataSHIELD: An Ethically Robust Solution to Multiple-Site Individual-Level Data Analysis. Public Health Genomics. 2015;18(2):87–96.
2.
Hess M, Lenz S, Blätte TJ, Bullinger L, Binder H. Partitioned learning of deep Boltzmann machines for SNP data. Bioinformatics. 2017 Oct 15;33(20):3173–80.
3.
Lenz S, Zöller D, Hess M, Binder H. Architectures for distributed privacy-preserving deep learning. In: Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS), Hrsg. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018. [Accessed 16 July 2019]. Available from: https://www.egms.de/static/en/meetings/gmds2018/18gmds097.shtml Externer Link
4.
MIRACUM – Medical Informatics in Research and Medicine. [Accessed 16 July 2019]. Available from: http://www.miracum.org/ Externer Link