Artikel
Anonymize and Synthesize – Enhanced Privacy-Preserving Methods for Heart Failure Score Analytics
Suche in Medline nach
Autoren
Veröffentlicht: | 6. September 2024 |
---|
Gliederung
Text
Introduction: In modern data-driven medical research, data availability is crucial, but sharing sensitive medical data comes with privacy concerns. Traditional anonymization methods such as k-Anonymization combined with generalisation and outlier suppression, and newer synthetic data generation techniques utilizing generative AI models (Generative Adversarial Networks and Variational Autoencoders) and classical ML methods (Gaussian Copula) offer privacy protection, but their effectiveness and utility need evaluation. We investigate these aspects using a real-world dataset of cardiology patients’ health records. We also combine anonymization with synthetization to get the best of both worlds.
Methods: As a basis, we employed the tabular health record dataset of the HiGHmed Use Case Cardiology [1], which consists of 2441 entries of 34 features, 18 of which were used to calculate the heart failure risk scores Barcelona BioHF and MAGGIC. We applied anonymization [2], [3] and synthesis [4], [5] techniques separately and in combination. We assessed the utility of the resulting datasets by calculating heart failure risk scores in line with the original study. Each feature was evaluated with respect to its statistical equivalence in comparison to the original data. For univariate distributions, we used the Kolmogorov-Smirnov test for continuous variables and the Χ²-test for categorical/boolean variables. Furthermore, the distributions of heart-failure risk scores as a derived quantity were compared to the distribution in the original population by means of the respective cumulative density functions. Finally a re-identification risk assessments was applied by using membership inference, attribute inference and singling-out attacks.
Results: Both anonymization and synthesis methods preserved statistical properties with minimal deviations from the original dataset. A combination of anonymization followed by synthetization showed significant differences in the distribution of only 1 out of the 18 variables of the dataset. Re-identification risks were assessed, showing low risks for both anonymized and synthetic datasets. Risks could be even lowered with the combination of both techniques, with a reasonable trade-off in utility.
Conclusion: Anonymization and synthesis methods offer effective privacy protection while preserving data utility for heart failure risk score computations. Combining these methods provides enhanced privacy guarantees and scalability for sharing medical data. Further research is needed to address limitations and improve privacy risk estimation methods, specifically in the context of small training data sets.
The generated datasets will be shared with the scientific community under a use and access agreement.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Sommer KK, Amr A, Bavendiek U, Beierle F, Brunecker P, Dathe H, Eils J, Ertl M, Fette G, Gietzelt M, Heidecker B, Hellenkamp K, Heuschmann P, Hoos JDE, Kesztyüs T, Kerwagen F, Kindermann A, Krefting D, Landmesser U, Marschollek M, Meder B, Merzweiler A, Prasser F, Pryss R, Richter J, Schneider P, Störk S, Dieterich C. Structured, Harmonized, and Interoperable Integration of Clinical Routine Data to Compute Heart Failure Risk Scores. Life (Basel). 2022 May 18;12(5):749. DOI: 10.3390/life12050749
- 2.
- Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—Current status and challenges ahead. Softw Pract Exp. Jul 2020;50:1277–1304.
- 3.
- Prasser F, Bild R, Eicher J, Spengler H, Kohlmayer F, Kuhn KA. Lightning: Utility-Driven Anonymization of High-Dimensional Data. Trans Data Priv. 2016;9:161–185.
- 4.
- Johann TI, Wilhelmi H. ASyH - Anonymous Synthesizer for Health Data. GitHub; 2023. Available from: https://github.com/dieterich-lab/ASyH
- 5.
- Patki N, Wedge R, Veeramachaneni K. The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2016 Oct 17-19; Montreal, Canada.