gms | German Medical Science

66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

26. - 30.09.2021, online

Privacy-preserving duplicate detection of patient data at different sites as a prerequisite for distributed statistical analysis – implementation in the DIFUTURE consortium

Meeting Abstract

  • Adam Mahmoud - Institute for medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität München, Munich, Germany, Munich, Germany; DIFUTURE (Data Integration for Future Medicine), Munich, Germany
  • Ulrich Mansmann - Institute for medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität, Munich, Germany; DIFUTURE (Data Integration for Future Medicine), Munich, Germany
  • Isabel Reinhardt - Institute for medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität, Munich, Germany; Trusted Third Party of the Faculty of Medicine, Ludwig-Maximilians-Universität, Munich, Germany
  • Daniel Nasseh - Comprehensive Cancer Center, Munich, Germany
  • Fady Albashiti - DIFUTURE (Data Integration for Future Medicine), Munich, Germany; Medical Data Integration Center, Munich, Germany
  • Verena S. Hoffmann - Institute for medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität, Munich, Germany; DIFUTURE (Data Integration for Future Medicine), Munich, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 26.-30.09.2021. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 148

doi: 10.3205/21gmds069, urn:nbn:de:0183-21gmds0698

Published: September 24, 2021

© 2021 Mahmoud et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: The MI-I consortium DIFUTURE [1] unlocks clinical routine data for medical research that requests to pool data over multiple sites in order to reach a critical sample size or to guarantee the generalization of derived results [2]. To address issues regarding data privacy, e.g. missing consent information, DIFUTURE focuses on using privacy-preserving distributed analysis, e.g. based on DataSHIELD [3].

The main challenge when analyzing federated data is to detect identical information units of the same individual scattered over the involved sources. Cleaning federated data from duplicates by a privacy-preserving duplicate search is a necessary task. It needs a technical as well as administrative work-flow.

State of the Art: Duplication detection is a critical part of record linkage using privacy-preserving methods. On the data provider side, a hashing function or bloom-filters [4] are applied to the data [5]. In general, record-linkage is commonly performed by an independent third party [6].

Concept: We present a web-based, up-to-date implementation of federated privacy-preserving record-linkage (PPRL). A harmonized selection of identifiers is defined, validated, and normalized. PPRL builds on hashing of identifying data at the participating sites and linking them to a trusted third party (TTP). The TTP identifies the duplicated entries by comparing the hashed values in a deterministic record linkage process. Only the hash values leave the site, which is considered anonymous according to the EU GDPR recital 26. Knowing the number of duplicates, analysis plans and data sets can be adapted accordingly.

Implementation: Hashing is inbuilt in a frontend web application using Angular and PHP. It is hosted by the TTP. Data sets of each site consist of the patient's (multiple) first and last names, birth name, date of birth, sex, and insurance number. Data validation, harmonization, and hashing is performed client-side in a web-browser at each site. This includes checking encoding and column names, deletion of superfluous columns, spelling harmonization, and separation of double names. An entry is considered as a duplicate if the insurance number matches or all the following criteria are met: (1) At least one of the first names or last names matches; (2) The date of birth matches; (3) The gender is equal or empty.

The proposed PPRL provides a table of unique entries per person indicating the visited sites. This table can be used to cleanse data afterward.

Lessons Learned: The implemented PPRL approach is secure, evaluated on a gold standard data set, and returns useful results on a real data set. Web technology makes it easy to install, to maintain, and to use.Clientside data validation has been shown to be an effective measure to ensure correct data submission, especially regarding data structure. Besides technical implementation it is also necessary to establish an administrative framework in which the record linkage is performed. Reaching this goal and mastering the respective ELSI-aspects were the second achievement of this project.

Acknowledgement: DIFUTURE is funded by the Bundesministerium für Bildung und Forschung (BMBF): 01ZZ1804C.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Prasser F, Kohlbacher O, Mansmann U, Bauer B, Kuhn KA. Data Integration for Future Medicine (DIFUTURE). Methods Inf Med. 2018;57(S 01):e57–65.
2.
Hripcsak G. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Studies in health technology and informatics. 2015;216:574–578.
3.
Budin-Lj\u248 ?sne I, Burton P, Isaeva J, Gaye A, Turner A, Murtagh MJ, et al. DataSHIELD: an ethically robust solution to multiple-site individual-level data analysis. Public Health Genomics. 2015;18(2):87–96.
4.
Schnell R, Bachteler T, Reiher J. Privacy-preserving record linkage using Bloom filters. BMC Med Inform Decis Mak. 2009;9(1):41.
5.
Bian J, Loiacono A, Sura A, Mendoza Viramontes T, Lipori G, Guo Y, et al. Implementing a hash-based privacy-preserving record linkage tool in the OneFlorida clinical research network. JAMIA Open. 2019;2(4):562–9.
6.
Nasseh D, Stausberg J. Evaluation of a binary semi-supervised classification technique for probabilistic record linkage. Methods Inf Med. 2016;55(2):136–43.