gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Perturbation of proteomic biomarker data for sharing: gaining privacy whilst preserving utility?

Meeting Abstract

Search Medline for

  • Jeppe Christensen - Medical University of Vienna, Vienna, Austria
  • Harald Mischak - Mosaiques Diagnostics & Therapeutics AG, Hannover, Germany
  • Georg Heinze - Medical University of Vienna, Wien, Austria

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 449

doi: 10.3205/20gmds363, urn:nbn:de:0183-20gmds3630

Published: February 26, 2021

© 2021 Christensen et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Release of medical data is important in the scientific world, but it compromises patient privacy, which is a major concern. A possible approach for handling this is to release privacy preserving data sets, in which the original values are changed in a way that it gets harder to identify patients (increasing ‘privacy’), but which, when analyzed, give approximately the same results (keeping ‘utility’).

Depending on the specific approach different levels of utility and privacy will be preserved, and since these qualities counteract each other, they need to be balanced. With prediction models built from proteomic biomarker data, this is further complicated by zero-inflated and non-symmetric distributions. In this paper we investigate to which extent data changing methods can preserve utility, i.e. performance of such prediction models while guaranteeing a certain level of privacy. We exemplify the trade-off between privacy and utility by means of a biomedical study where a binary outcome, progression of chronic kidney disease, should be predicted by proteomic biomarkers.

Methods: We consider a perturbation method that consists of three steps: (1) transforming the data to principal components, (2) permuting the principal components that are not needed to achieve a specified threshold of variance explained, and (3) reversing the transformation. We consider thresholds ranging from 60% to 99%, and expect that this will yield different levels of privacy.

As ‘utility’ of a data set we consider the performance of a LASSO prediction model that is trained on it. In particular, we use the Brier score and the AUROC and compare the values we obtained from the original data set with those achieved after perturbation. We also measure average agreement on selected biomarkers expressed by true and false positives.

To measure ‘privacy’ we consider two quantities based on the expected proximity (in terms of Euclidian distance) of a patient's original data to that patient's perturbed data.

Results: The models trained on the privacy preserving data sets on average yield AUROC values almost as high as those trained on the original data. They also chose the same variables as in the original models fairly often. However, the models resulting from training on the privacy preserving data sets appear to be much more sensitive to the data splitting, as the AUROC values as well as the variable selections are more volatile the lower the threshold of variance explained is. More stability is achieved if the proteomic variables are pre-transformed to avoid outliers.

Conclusion: Privacy was acceptable only for high levels of perturbation. Nevertheless, irrespective of the level of perturbation, with appropriately pre-transformed proteomic variables, the expected values of utility (AUROC and Brier score) did not differ from the corresponding values in the original sample. However, the more we perturbed the data, the more variability in utility measures was observed.

While it seems that demands of privacy and utility cannot be easily bridged, we think that our perturbation method based on principal components, coupled with a sensible pre-transformation of the data, is a step in the right direction.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.