gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

De-Identifying GRASCCO – a Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus

Meeting Abstract

  • Christina Lohr - IMISE, Universität Leipzig, Leipzig, Germany
  • Franz Matthies - IMISE, Universität Leipzig, Leipzig, Germany
  • Jakob Faller - Medical Center for Information and Communication Technology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
  • Luise Modersohn - Institute for Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
  • Andrea Riedel - Medical Center for Information and Communication Technology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
  • Udo Hahn - IMISE, Universität Leipzig, Leipzig, Germany
  • Rebekka Kiser - Institute for Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
  • Martin Boeker - Institute for Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
  • Frank A. Meineke - IMISE, Universität Leipzig, Leipzig, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 925

doi: 10.3205/24gmds080, urn:nbn:de:0183-24gmds0803

Veröffentlicht: 6. September 2024

© 2024 Lohr et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The German Medical Text Project (GeMTeX) is one of the largest infrastructure efforts targeting German-language clinical documents. We here introduce the architecture of the de-identification pipeline of GeMTeX.

Methods: This pipeline comprises the export of raw clinical documents from the local hospital information system, the import into the annotation platform INCEpTION, fully automatic pre-tagging with protected health information (PHI) items by the Averbis Health Discovery pipeline, a manual curation step of these pre-annotated data, and, finally, the automatic replacement of PHI items with type-conformant substitutes. This design was implemented in a pilot study involving six annotators and two curators each at the Data Integration Centers of the University Hospitals Leipzig and Erlangen.

Results: As a proof of concept, the publicly available Graz Synthetic Text Clinical Corpus (GraSSCo) was enhanced with PHI annotations in an annotation campaign for which reasonable inter-annotator agreement values of Krippendorff's α ≅ 0.97 can be reported.

Conclusion: These curated 1.4 K PHI annotations are released as open-source data constituting the first publicly available German clinical language text corpus with PHI metadata.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.