Article
NFDI4Health Workflows for Re-Identification Risk Management
Search Medline for
Authors
Published: | September 6, 2024 |
---|
Outline
Text
Introduction: NFDI4Health is establishing a comprehensive infrastructure for sharing epidemiological, public health, and clinical trial data. This includes the development of services and workflows supporting data holders with analyzing and managing re-identification risks. Previous work has demonstrated the value of anonymization, i.e., the process of irreversibly altering data in such a way that it cannot, or only with means beyond what is reasonably likely, be linked to an identified or identifiable natural person, as a protection mechanism for medical datasets. In NFDI4Health, selected approaches are being refined, integrated, and made accessible to data holders. More specifically, a workflow and tools have been developed for schema-level, i.e. on the level of attributes and their data types, risk assessments, record-level data anonymization, and the analysis of residual risks.
Methods: The initial step of the developed workflow involves conducting a risk assessment at attribute level to pinpoint attributes which may increase re-identification risks as well as attributes that necessitate additional protection measures against inference due to their sensitive nature. This step of the workflow is facilitated by a spreadsheet in which attributes can be ranked according to different criteria, combined with a threshold approach. Following the assessment, we offer an open source software pipeline based on the ARX Data Anonymization Tool, which can be configured using the result of the risk assessment process and re-identification risk thresholds. ARX has been selected after a thorough review of existing tools supporting anonymization processes. The pipeline produces an anonymized dataset and a report that describes changes performed to the dataset. These statistics provide valuable insights on the effect of anonymization without the need to access the actual data directly. As a last step, a tool is provided for quantifying the risk of membership inference, i.e. assessing the likelihood that an adversary could determine whether a specific target individual was included in the dataset. The method used, originally designed for synthesized data, was adapted for use with anonymized datasets, also to support a comparison of anonymized and synthetized data within NFDI4Health.
Results: We present an application of this workflow and the developed services to a dataset provided by the EPIC-Potsdam study located at the German Institute of Human Nutrition. Ultimately, we were able to anonymize the dataset effectively, maintaining its usability while significantly reducing privacy risks, even with relatively conservative parameter adjustments.
Discussion and outlook: Our results show that the approach can successfully be employed to anonymize real-world study datasets. Currently, the utility of anonymized datasets is measured using rather generic metrics, which may not fully capture the impact of anonymization. In future work, we plan to adopt more flexible approaches to enable users of the workflow to assess the impact more specifically, e.g. by providing use case-specific plugins.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Fluck J, Lindstädt B, Ahrens W, Beyan O, Buchner B, Darms J, et al. NFDI4Health - Nationale Forschungsdateninfrastruktur für personenbezogene Gesundheitsdaten. Bausteine Forschungsdatenmanag. 2021;2/2021:72-85. DOI: 10.17192/bfdm.2021.2.8331
- 2.
- Jakob CE, Kohlmayer F, Meurers T, Vehreschild JJ, Prasser F. Design and evaluation of a data anonymization pipeline to promote Open Science on COVID-19. Sci Data. 2020;7(1):435. DOI: 10.1038/s41597-020-00722-x
- 3.
- Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—Current status and challenges ahead. Softw Pract Exper. 2020; 50(7):1277-1304. DOI: 10.1002/spe.2782
- 4.
- Haber AC, Sax U, Prasser F, NFDI4Health Consortium. Open tools for quantitative anonymization of tabular phenotype data: literature review. Brief Bioinform. 2022;23(6):bbac440. DOI: 10.1093/bib/bbac440
- 5.
- Stadler T, Oprisanu B, Troncoso C. Synthetic data–anonymisation groundhog day. In: 31st USENIX Security Symposium (USENIX Security 22). 2022. p. 1451-1468.