gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

Assessment of Bloom Filter parameters for Privacy-Preserving Record Linkage

Meeting Abstract

Suche in Medline nach

  • Maximilian Jugl - Institut für Medizinische Informatik, Statistik und Epidemiologie, Universität Leipzig, Leipzig, Germany; Dept. Medical Data Science, Leipzig University Medical Center, Leipzig, Germany
  • Toralf Kirsten - Institut für Medizinische Informatik, Statistik und Epidemiologie, Universität Leipzig, Leipzig, Germany; Dept. Medical Data Science, Leipzig University Medical Center, Leipzig, Germany
  • Ulrich Sax - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Leipzig, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 167

doi: 10.3205/22gmds063, urn:nbn:de:0183-22gmds0636

Veröffentlicht: 19. August 2022

© 2022 Jugl et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: In the absence of an overarching unique identifier [1], Privacy-preserving record linkage (PPRL) is a probabilistic approach to find and link records in distributed data sets referring to the same real world person by considering privacy regulations. The use of Bloom filters [2], [3] has become attractive recently. Bloom filters are a well-known probabilistic data structure that obfuscate the data passed into them while also preserving the similarity between.

They are using hash functions to represent input data in a single bit vector. In this work, we study how the characteristics of input data, the used hash functions, and the configuration of the Bloom filter itself influences the similarity calculations.

Methods: By splitting the values of data records into n-grams and inserting them into a Bloom filter, it is possible to compare the set bit positions between two bit vectors. A set similarity measure can be used to express the likelihood of two bit vectors originating from the same record and, consequentially, referring to the same identity. The quality of identified matches therefore depends on four parameters: n-gram size (n), Bloom filter size (m), amount of hash functions (k) and the type of hash functions used.

We studied the effects of these parameters on a randomly generated dataset consisting of 10,000 records. Every record consists of a first and last name, gender, birth date, street, city and zip code. This dataset is then used to derive a second corrupted dataset, where a series of random mutations is applied. A pair of records is to be considered a match if they are identical or if a single character mutation has been performed on them. Records from both datasets are masked and compared pairwise. Their computed similarities are recorded. As a baseline, SHA-256 is used as a hash function with (n=2), (m=512) and (k=3) to compare against.

Results: We find that lowering (m) drastically lowers the quality of matches while keeping all other parameters constant. Raising (k) causes a degradation in quality as well, but less significantly so. Raising (n) slightly lowers the computed similarities for matches and non-matches alike, though the gap between them increases marginally. Changing hash functions has no noticeable impact on match quality.

Discussion: Selection of parameters is highly dependent on input data. Larger records require larger bloom filters. By consequence, there is always a tradeoff to be made between (m) and (k) in order to guarantee optimal results. Generally, a higher (n) will benefit the matching of identical records, but comes at the cost of error tolerance. It also greatly impacts security, since reidentification attacks against Bloom filters have already been carried out successfully if parameters are not chosen carefully [4], [5]. Ideally, Bloom filters should be configured so that approximately 50 % of bits are set.

Conclusion: In this work, we conducted an analysis of Bloom filter parameters for use in PPRL. Future research should focus on the parameterisation of alternative masking methods such as RBF or CLKRBF [6], [7]. An effort should be made towards making parameterisation less verbose for non-tech-savvy users.

This work was done as part of the NFDI4Health Task Force COVID-19 (https://www.nfdi4health.de). We gratefully acknowledge the financial support of the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project Number 451265285.

The authors declare that an ethics committee vote is not required.


References

1.
Kirsten T, Richter A, Schmidt CO, Drepper J, Kuntz AS, Kusch H, et al. The FAIR Record Linkage Challenge in NFDI4Health. In: 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). 26.-30.09.2021. GMS; 2021. DocAbstr. 166. DOI: 10.3205/21gmds005 Externer Link
2.
Schnell R, Bachteler T, Reiher J. A Novel Error-Tolerant Anonymous Linking Code. Report No.: ID 3549247. Rochester, NY: Social Science Research Network; 2011 Nov [cited 2022 Mar 14]. Available from: https://papers.ssrn.com/abstract=3549247 Externer Link
3.
Bloom BH. Space/time trade-offs in hash coding with allowable errors. Commun ACM. 1970 Jul 1;13(7):422–6.
4.
Kuzu M, Kantarcioglu M, Durham E, Malin B. A Constraint Satisfaction Cryptanalysis of Bloom Filters in Private Record Linkage. In: Fischer-Hübner S, Hopper N, editors. Privacy Enhancing Technologies. Berlin, Heidelberg: Springer; 2011. (Lecture Notes in Computer Science). p. 226–45.
5.
Christen P, Schnell R, Vatsalan D, Ranbaduge T. Efficient Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage. In: Kim J, Shim K, Cao L, Lee JG, Lin X, Moon YS, editors. Advances in Knowledge Discovery and Data Mining. Cham: Springer International Publishing; 2017. (Lecture Notes in Computer Science). p. 628–40.
6.
Durham EA, Kantarcioglu M, Xue Y, Toth C, Kuzu M, Malin B. Composite Bloom Filters for Secure Record Linkage. IEEE Transactions on Knowledge and Data Engineering. 2014 Dec;26(12):2956–68.
7.
Schnell R, Borgs C. Randomized Response and Balanced Bloom Filters for Privacy Preserving Record Linkage. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). 2016. p. 218–24.