gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

Anonymize and Synthesize – Enhanced Privacy-Preserving Methods for Heart Failure Score Analytics

Meeting Abstract

Suche in Medline nach

  • Tim Ingo Johann - University Hospital Heidelberg, Heidelberg, Germany
  • Karen Otte - Medical Informatics Group, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
  • Fabian Prasser - Medical Informatics Group, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
  • Christoph Dieterich - Klaus Tschira Institute for Integrative Computational Cardiology, Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany; German Center for Cardiovascular Research (DZHK), Partner Site Heidelberg/Mannheim, Heidelberg, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 805

doi: 10.3205/24gmds152, urn:nbn:de:0183-24gmds1524

Veröffentlicht: 6. September 2024

© 2024 Johann et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: In modern data-driven medical research, data availability is crucial, but sharing sensitive medical data comes with privacy concerns. Traditional anonymization methods such as k-Anonymization combined with generalisation and outlier suppression, and newer synthetic data generation techniques utilizing generative AI models (Generative Adversarial Networks and Variational Autoencoders) and classical ML methods (Gaussian Copula) offer privacy protection, but their effectiveness and utility need evaluation. We investigate these aspects using a real-world dataset of cardiology patients’ health records. We also combine anonymization with synthetization to get the best of both worlds.

Methods: As a basis, we employed the tabular health record dataset of the HiGHmed Use Case Cardiology [1], which consists of 2441 entries of 34 features, 18 of which were used to calculate the heart failure risk scores Barcelona BioHF and MAGGIC. We applied anonymization [2], [3] and synthesis [4], [5] techniques separately and in combination. We assessed the utility of the resulting datasets by calculating heart failure risk scores in line with the original study. Each feature was evaluated with respect to its statistical equivalence in comparison to the original data. For univariate distributions, we used the Kolmogorov-Smirnov test for continuous variables and the Χ²-test for categorical/boolean variables. Furthermore, the distributions of heart-failure risk scores as a derived quantity were compared to the distribution in the original population by means of the respective cumulative density functions. Finally a re-identification risk assessments was applied by using membership inference, attribute inference and singling-out attacks.

Results: Both anonymization and synthesis methods preserved statistical properties with minimal deviations from the original dataset. A combination of anonymization followed by synthetization showed significant differences in the distribution of only 1 out of the 18 variables of the dataset. Re-identification risks were assessed, showing low risks for both anonymized and synthetic datasets. Risks could be even lowered with the combination of both techniques, with a reasonable trade-off in utility.

Conclusion: Anonymization and synthesis methods offer effective privacy protection while preserving data utility for heart failure risk score computations. Combining these methods provides enhanced privacy guarantees and scalability for sharing medical data. Further research is needed to address limitations and improve privacy risk estimation methods, specifically in the context of small training data sets.

The generated datasets will be shared with the scientific community under a use and access agreement.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Sommer KK, Amr A, Bavendiek U, Beierle F, Brunecker P, Dathe H, Eils J, Ertl M, Fette G, Gietzelt M, Heidecker B, Hellenkamp K, Heuschmann P, Hoos JDE, Kesztyüs T, Kerwagen F, Kindermann A, Krefting D, Landmesser U, Marschollek M, Meder B, Merzweiler A, Prasser F, Pryss R, Richter J, Schneider P, Störk S, Dieterich C. Structured, Harmonized, and Interoperable Integration of Clinical Routine Data to Compute Heart Failure Risk Scores. Life (Basel). 2022 May 18;12(5):749. DOI: 10.3390/life12050749 Externer Link
2.
Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—Current status and challenges ahead. Softw Pract Exp. Jul 2020;50:1277–1304.
3.
Prasser F, Bild R, Eicher J, Spengler H, Kohlmayer F, Kuhn KA. Lightning: Utility-Driven Anonymization of High-Dimensional Data. Trans Data Priv. 2016;9:161–185.
4.
Johann TI, Wilhelmi H. ASyH - Anonymous Synthesizer for Health Data. GitHub; 2023. Available from: https://github.com/dieterich-lab/ASyH Externer Link
5.
Patki N, Wedge R, Veeramachaneni K. The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2016 Oct 17-19; Montreal, Canada.