gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

Overview of German Clinical Text Corpora for Large Language Models – Scoping Review

Meeting Abstract

  • Simone Melnik - Institute of Medical Informatics, University of Münster, Münster, Germany; Clinic for Otorhinolaryngology, Münster University Hospital, Münster, Germany
  • Tobias Brix - Institute of Medical Informatics, University of Münster, Münster, Germany
  • Michael Storck - Institute of Medical Informatics, University of Münster, Münster, Germany
  • Sarah Riepenhausen - Institute of Medical Informatics, University of Münster, Münster, Germany
  • Julian Varghese - Institute of Medical Informatics, University of Münster, Münster, Germany
  • Claudia Rudack - Clinic for Otorhinolaryngology, Münster University Hospital, Münster, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 743

doi: 10.3205/24gmds079, urn:nbn:de:0183-24gmds0798

Veröffentlicht: 6. September 2024

© 2024 Melnik et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: With a growing request for domain-specific large language models (LLMs), the demand for proper datasets to train such models is also increasing. Medical LLMs offer a wide range of applications to enhance patient-centered work and improve the efficiency for routine clinical practice. Based on a literature review following the PRISMA guidelines, we searched for publicly available German-language medical text datasets and compared them in terms of content [1]. This work is intended to provide an overview of the current status of publicly available datasets.

Methods: The literature search for the datasets, conducted in mid-April 2024 according to PRISMA guidelines [1], includes all articles published up to that time. The web search was carried out using the terms “german clinical corpus”, “german medical corpus”, “german clinical dataset” and “german medical dataset” in the paper title on Google Scholar (https://scholar.google.com/) and PubMed (https://pubmed.ncbi.nlm.nih.gov/). The arXiv (https://www.arxiv.org/) search was made without title restrictions. After removing duplicates, the inclusion criteria were checked. We only included German-language clinical/medical corpora that contained text files and are available for sharing. Based on the search results, a forward and backward search was added.

Results: The initial search yielded 59 articles. After removing 20 duplicates, further 33 articles were removed since they did not fulfill at least one of the inclusion criteria. Thus, the PRISMA workflow [1] led to a total of six different corpora and an additional corpus through reviewing the references. The results were grouped into corpora with real, de-identified data and corpora with synthetic data generated artificially. Real data can be found in two main domains, cardiological data, for example, can be found in the CARDIO:DE corpus from Heidelberg [2]. Furthermore, with GERNERMED [3] and GPTNERMED [4], Frei and Kramer have provided a real translated dataset and a synthesized dataset created by LLMs.

Discussion: The deformation of real data to produce synthetic data can lead to problems during analysis, for example standardized values are no longer available. Corpora based on real German-language clinical data provide authentic data, but are associated with more work due to de-identification [2]. The yet-to-be-published GeMTeX corpus [5], which is announced to be the largest German-language text corpus with real annotated clinical data, would produce a larger diversity in the existing domain data.

Methodological limitations of this study include potential oversight of relevant datasets due to search term specificity and filtered results from Google Scholar and PubMed that required search terms in the titles. Accordingly, one corpus was only found by back-referencing another paper.

Conclusion: The literature review for German-language clinical text corpora that are available led to seven results. They can be divided into synthetic and real datasets. Synthetic datasets have the advantage that they are often easier to access, but have the disadvantage that they may not always accurately reflect clinical reality and re-identification cannot be ruled out. Corpora with real clinical data are created with a higher workload, but reflect the complex reality of clinical documents and are therefore better suited to training LLMs for use.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.
2.
Richter-Pechanski P, Wiesenbach P, Schwab DM, Kiriakou C, He M, Allers MM, et al. A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters. Sci Data. 2023;10(1):207.
3.
Frei J, Kramer F. GERNERMED - An Open German Medical NER Model [Preprint]. arXiv. 2021 Sep 24. DOI: 10.48550/arXiv.2109.12104 Externer Link
4.
Frei J, Kramer F. Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform. 2023;145:104478.
5.
Meineke F, Modersohn L, Loeffler M, Boeker M. Announcement of the German Medical Text Corpus Project (GeMTeX). Stud Health Technol Inform. 2023;302:835–6.