Article
Which properties need protection? An application to identify vulnerable patterns in health datasets
Search Medline for
Authors
Published: | September 6, 2024 |
---|
Outline
Text
Introduction: When sharing medical data, anonymization is one of the key strategies to deal with privacy concerns and is often legally mandated, for example for the Center for Cancer Registry Data at the Robert-Koch-Institute and the Health Data Lab (FDZ Gesundheit) at the Federal Institute for Drugs and Medical Device. However, good anonymization strategies are highly dependent on the context in which data is shared including factors like the recipients involved as well as their potential background knowledge and technical capabilities. Only when the context has been accurately understood, can suitable anonymization strategies be implemented and the corresponding tools set up correctly.
State of the art: Several methodologies have been suggested for qualitatively and quantitatively analyzing properties of data that increase re-identification risks, likely background knowledge of anticipated adversaries as well as contextual information, for example, about security controls implemented by the recipients. However, these methods can be complex to apply, in particular if multiple aspects are to be studied in combination, for example to develop anonymization concepts.
Concept: Our aim was to develop a graphical application that supports a range of common risk assessment methods in a form in which they can be combined with each other to support integrated modelling of re-identification risks for different threat actors.
Implementation: We have developed a web-application, using PostgreSQL, Spring Boot and Angular.js, which allows users to perform a structured risk assessment to identify sensitive and potentially identifying variables within a specific data sharing context. Based on a method proposed by Malin et al., several anticipated adversaries can be modelled. The results of this analysis process can then be used to derive anonymization methods protecting data from multiple threats.
Lessons learned: While the developed application is useful for several steps in developing data anonymization processes, it currently requires specific expertise to establish risk scores associated with each variable. In future work, we aim to expand this application and encompass questionnaires that assist users in specifying properties regarding the context of data sharing to meet the growing need to provide anonymization tools to a wider audience.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Haber AC, Sax U, Prasser F; NFDI4Health Consortium. Open tools for quantitative anonymization of tabular phenotype data: literature review. Brief Bioinform. 2022;23(6):bbac440. DOI: 10.1093/bib/bbac440
- 2.
- Malin B, Loukides G, Benitez K, Clayton EW. Identifiability in biobanks: models, measures, and mitigation strategies. Hum Genet. 2011;130(3):383-92. DOI: 10.1007/s00439-011-1042-5
- 3.
- Jakob CE, Kohlmayer F, Meurers T, Vehreschild JJ, Prasser F. Design and evaluation of a data anonymization pipeline to promote Open Science on COVID-19. Sci Data. 2020;7(1):435. DOI: 10.1038/s41597-020-00722-x