gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

Patient feature substitution with synthetic data for developing open corpora

Meeting Abstract

  • Dustin Thewes - FH Aachen, Aachen, Germany; Medizinische Fakultät der RWTH Aachen, Aachen, Germany
  • Henri Werth - FH Aachen, Jülich, Germany
  • Jan Wienströer - Institute of Medical Informatics - RWTH Aachen University, Aachen, Germany
  • Saskia von Stillfried und Rattonitz - Medizinische Fakultät der RWTH Aachen, Aachen, Germany
  • Oliver Schmidts - FH Aachen, Jülich, Germany
  • Ines Siebigteroth - FH Aachen, Jülich, Germany
  • Peter Boor - Medizinische Fakultät der RWTH Aachen, Aachen, Germany
  • Rainer Röhrig - Medizinische Fakultät der RWTH Aachen, Aachen, Germany
  • Matthias Meinecke - FH Aachen, Aachen, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 2

doi: 10.3205/22gmds026, urn:nbn:de:0183-22gmds0267

Veröffentlicht: 19. August 2022

© 2022 Thewes et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe



Introduction: The subject of this work is the creation of a synthetic German clinical corpus that does not contain any information useable to re-identify an individual. Clinical documents include sensitive patient information. The removal of 18 protected health information (PHI) (cf. [1], [2] described in HIPAA [3] is a legally acceptable de-identification method in the USA but not in the EU [4]. Re-identification remains a threat, e.g., by cross-referencing using non-PHI patient information [5], [6].

Concept: We present a different approach to developing corpora void of personal information by replacing all potentially linkable information with fitting, pseudo-randomly chosen substitute information. We propose duplicating an arbitrary number of annotated documents hundreds of times, replacing all patient features, including 18 PHI described in HIPAA and non-PHI patient features like diagnoses, with substitute information.

Implementation: We obtained two autopsy reports where 18 PHI described in HIPAA were replaced (e.g., "DD.MM.YYYY" replaced a date, "XX" replaced a name). After minimal preprocessing, a data scientist and pathologist annotated all placeholder-PHI and other patient features following an annotation schema for the DeRegCOVID autopsy register. The schema contained the labels "Date of Death", "Date of Autopsy", Causes of Death "IA", "IB", "IC", "ID", "II", and "Diagnosis", In the next step, the two documents were duplicated 125 times. In each copy, pseudo-randomly chosen diagnoses found in ICD-10-GM Version 2020 replace "Diagnosis" and Causes of Death ("IA", "IB", "IC", "ID", "II"). "Date of Death" is a pseudo-random date between the years 2000-2019, and the "Date of Autopsy" is a pseudo-random date 1-7 days after the Date of Death.

Evaluation methods: The corpus is quantitatively evaluated by training and evaluating a Named Entity Recognizer spaCy ( model with the default German de_core_news_lg config. The model was trained and evaluated in a 200-50-2 train-dev-test split, where the evaluation set consists of the two real documents not contained within the corpus. In addition, two synthetic documents were compared to the two original documents using spaCy's displaCy visualizer for the qualitative evaluation.

Results: The corpus contains 250 documents, 237k tokens, and 32k entities, 95% of which are diagnoses. Out of 3193 tokens in the original documents, 1433 (49%) were replaced. The evaluation of the NER model yields a micro-average F1-Score of 31.71%. The diagnoses and causes of death in the corpus are more clinical and less descriptive than the findings found in autopsy reports. No patient-linkable information was found.

Discussion: The evaluation reveals a tradeoff between quality and privacy. With 51% unchanged tokens and qualitatively worse substitute data, the corpus does not feature the linguistic heterogeneity featured in genuine corpora. More research is required to improve these results, e.g., by utilizing a German SNOMED CT.

Funding: This work was supported by the German Registry of COVID-19 Autopsies (, funded Federal Ministry of Health (ZMVI1-2520COR201), by the Federal Ministry of Education and Research within the framework of the network of university medicine (DEFEAT PANDEMICs, 01KX2021).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Boag W, Naumann T, Szolovits P. Towards the Creation of a Large Corpus of Synthetically-Identified Clinical Notes [Preprint]. ArXiv. 2018 Mar 7. arXiv:1803.02728v1. DOI: 10.48550/arXiv.1803.02728 Externer Link
Lohr C, Eder E, Hahn U. Pseudonymization of PHI Items in German Clinical Reports. Stud Health Technol Inform. 2021;281:273–7.
US Department of Health & Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. 1.4 The De-identification Standard. Available from: Externer Link
Kittner M, Lamping M, Rieke DT, Götze J, Bajwa B, Jelas I, et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA Open. 2021;4(2):ooab025.
Sweeney L. Computational disclosure control: a primer on data privacy protection [Thesis]. Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001. Available from: Externer Link
Sweeney L. Weaving technology and policy together to maintain confidentiality. J Law Med Ethics. 1997;25(2-3):98-110. DOI: 10.1111/j.1748-720X.1997.tb01885.x Externer Link