gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

Towards an Annotated Data Set for De-Identification of German Discharge Letters

Meeting Abstract

Search Medline for

  • Jonathan Krebs - University of Würzburg, Würzburg, Germany
  • Georg Fette - University Hospital of Würzburg, Würzburg, Germany
  • Frank Puppe - University of Würzburg, Würzburg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 238

doi: 10.3205/19gmds105, urn:nbn:de:0183-19gmds1055

Published: September 6, 2019

© 2019 Krebs et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at



Introduction: In order to use unstructured, medical texts in clinical research, all personal data from patients (patient health information, PHI) must be removed first. In order to make the quality of anonymization tools comparable, a standardized and publicly available data set is needed. To the best knowledge of the authors, no such data set is currently available for German-language documents. This work shows the progress in the preparation of such a data set, following the procedure for English-language documents performed by [1].

Label Set: Publications exist showing that actually no multi-label set is necessary for the de-identification of clinical records [2]. Instead, the task can be interpreted as a binary classification task. A pure binary classification would be sufficient for the resulting texts, but a finer set of labels allows a more accurate error analysis, as one can determine, which classes of PHIs cause more problems. The authors decided to use a more limited set of labels than in [1], which is limited to the most important labels, but represents all important PHI classes. For example, expressions such as "first name" and "last name" were condensed to "name". Another Example: address data are uniformly labelled as "address" and not further divided into "street" or "city". The remaining classes are ADDRESS, PHONE, ORGANISATION, NAME, DOCTOR, JOB, DATE, TIME, EMAIL, WEB, ID.

Corpus: We are using a corpus of randomly sampled documents from all departments of the University Hospital of Würzburg. All documents were created between the years 2000 and 2018. At the moment 400 Documents haven been labelled. To make the data set comparable to the English one, we plan to label another 600 documents. Therefore, it should be possible to determine if State-of-the-Art methods for English texts [3] perform well on German texts, too.

Future Work: Before we can release the final data set, there are some critical steps left to do. First, all labelled PHIs must be replaced by artificial names, addresses, etc. so no real patient can be identified with this dataset. This must be done carefully, in order to keep consistencies within each document. Specific properties of the original PHIs must be preserved in the replacements to keep the de-identification task as hard as in original documents, therefore it is not possible to simply replace all names by default values. Another example: If an original name contains a typo, the replacement should contain one too, to preserve the hardness of identifying names that are written wrong.

Finally, the copyrights to the documents must be clarified with the University Hospital and the Ethics Council must give a positive vote.

Conclusion: The authors are working on a German data set for de-identification of discharge letters that we plan to publish, so the German medical research community can become on par with the English counterpart in this area.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of Biomedical Informatics. 2015;58 Suppl:S20-9.
Bui DDA, Redden DT, Cimino JJ. Is Multiclass Automatic Text De-Identification Worth the Effort? Methods of information in medicine. 2018;57(04):177-84.
Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of Biomedical Informatics. 2015;58 Suppl:S11-9.