Artikel
Generation and mutation of realistic personal identification data for the evaluation of record linkage algorithms
Suche in Medline nach
Autoren
Veröffentlicht: | 6. September 2024 |
---|
Gliederung
Text
Personal data is often scattered across various stakeholders due to its collection for various data collection purposes. This leads to a high degree of fragmentation, which necessitates the consolidation of multiple data sources in order to obtain a complete view of natural persons [1]. Linking personal data records together is trivial with a globally unique personal identifier, but such an identifier is often either not available or out of scope in most scenarios. Algorithms from the field of record linkage have therefore been employed instead. They operate on identification data and assign a similarity to record pairs in order to decide whether they should be merged or not.
These record linkage algorithms require testing on realistic data to evaluate their efficacy in real-world situations [2]. However due to the sensitive nature of identification data, access to real-world testing data has been mostly exclusive to researchers with personal ties to medical institutions in the past [3], [4]. This has led to the creation of tools which generate personal data that seems realistic based on publicly available data sources. To the best of our knowledge, all previously published tools are either inactive, unmaintained, closed source or outdated.
We present Gecko: an open-source Python library for the generation and mutation of personal identification data based on public data and error sources. It takes after GeCo which showed the promise of creating reproducible and shareable scripts to generate data [5]. The ease of integration into data science applications of the original library leaves a lot to be desired. Gecko addresses this by reimplementing GeCo’s core features on top of popular data science libraries and extending them by fixing GeCo’s limitations, allowing the generation of arbitrarily complex multivariate data, fine-grain control over its randomized routines and data mutation across multiple instead of single fields.
Gecko makes extensive use of Pandas data frames which allow exports of generated data in various interoperable file formats such as CSV. We validated that data generated by Gecko can be imported into E-PIX, which ensures Gecko’s compatibility with other tools with CSV parsing capabilities. Furthermore, we extensively benchmarked Gecko to ensure that it fulfills its performance claims. Despite the lack of test data from other solutions in the field, we estimate that Gecko’s single-core performance stands as best-in-class by a comfortable margin.
Gecko’s performance and configurability allows it to generate datasets with millions of records for the validation of record linkage algorithms in reasonable time frames. Its capabilities to quickly generate data on-the-fly opens it up for use in other data science applications where realistic identification data may be needed. A publicly available data repository allows for quick testing of Gecko’s capabilities for library users. We encourage users of Gecko to donate data samples from various regions and languages in order to obtain higher multilingual coverage. Future versions of Gecko aim at providing export facilities for FHIR, as well as support for more complex error classes such as temporal errors and column shifts in data.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Sauleau EA, Paumier JP, Buemi A. Medical record linkage in health information systems by approximate string matching and clustering. BMC Medical Informatics and Decision Making. 2005 Oct 11;5(1).
- 2.
- Christen P. Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. 2012. Chapter The Data Matching Process. p. 23–35.
- 3.
- Nguyen L, Stoové M, Boyle D, Callander D, McManus H, Asselin J, et al. Privacy-Preserving Record Linkage of Deidentified Records Within a Public Health Surveillance System: Evaluation Study. Journal of Medical Internet Research. 2020 Jun 24;22(6):e16757.
- 4.
- Randall SM, Ferrante AM, Boyd JH, Bauer JK, Semmens JB. Privacy-preserving record linkage on large real world datasets. Journal of Biomedical Informatics. 2014 Aug 1;50:205–12. DOI: 10.1016/j.jbi.2013.12.003
- 5.
- Christen P, Vatsalan D. Flexible and extensible generation and corruption of personal data. In: CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management; 2013 Oct 27 -Nov 1; San Francisco, CA, USA. p. 1165-1168. DOI: 10.1145/2505515.2507815