Artikel
Overview of German Clinical Text Corpora for Large Language Models – Scoping Review
Suche in Medline nach
Autoren
Veröffentlicht: | 6. September 2024 |
---|
Gliederung
Text
Introduction: With a growing request for domain-specific large language models (LLMs), the demand for proper datasets to train such models is also increasing. Medical LLMs offer a wide range of applications to enhance patient-centered work and improve the efficiency for routine clinical practice. Based on a literature review following the PRISMA guidelines, we searched for publicly available German-language medical text datasets and compared them in terms of content [1]. This work is intended to provide an overview of the current status of publicly available datasets.
Methods: The literature search for the datasets, conducted in mid-April 2024 according to PRISMA guidelines [1], includes all articles published up to that time. The web search was carried out using the terms “german clinical corpus”, “german medical corpus”, “german clinical dataset” and “german medical dataset” in the paper title on Google Scholar (https://scholar.google.com/) and PubMed (https://pubmed.ncbi.nlm.nih.gov/). The arXiv (https://www.arxiv.org/) search was made without title restrictions. After removing duplicates, the inclusion criteria were checked. We only included German-language clinical/medical corpora that contained text files and are available for sharing. Based on the search results, a forward and backward search was added.
Results: The initial search yielded 59 articles. After removing 20 duplicates, further 33 articles were removed since they did not fulfill at least one of the inclusion criteria. Thus, the PRISMA workflow [1] led to a total of six different corpora and an additional corpus through reviewing the references. The results were grouped into corpora with real, de-identified data and corpora with synthetic data generated artificially. Real data can be found in two main domains, cardiological data, for example, can be found in the CARDIO:DE corpus from Heidelberg [2]. Furthermore, with GERNERMED [3] and GPTNERMED [4], Frei and Kramer have provided a real translated dataset and a synthesized dataset created by LLMs.
Discussion: The deformation of real data to produce synthetic data can lead to problems during analysis, for example standardized values are no longer available. Corpora based on real German-language clinical data provide authentic data, but are associated with more work due to de-identification [2]. The yet-to-be-published GeMTeX corpus [5], which is announced to be the largest German-language text corpus with real annotated clinical data, would produce a larger diversity in the existing domain data.
Methodological limitations of this study include potential oversight of relevant datasets due to search term specificity and filtered results from Google Scholar and PubMed that required search terms in the titles. Accordingly, one corpus was only found by back-referencing another paper.
Conclusion: The literature review for German-language clinical text corpora that are available led to seven results. They can be divided into synthetic and real datasets. Synthetic datasets have the advantage that they are often easier to access, but have the disadvantage that they may not always accurately reflect clinical reality and re-identification cannot be ruled out. Corpora with real clinical data are created with a higher workload, but reflect the complex reality of clinical documents and are therefore better suited to training LLMs for use.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.
- 2.
- Richter-Pechanski P, Wiesenbach P, Schwab DM, Kiriakou C, He M, Allers MM, et al. A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters. Sci Data. 2023;10(1):207.
- 3.
- Frei J, Kramer F. GERNERMED - An Open German Medical NER Model [Preprint]. arXiv. 2021 Sep 24. DOI: 10.48550/arXiv.2109.12104
- 4.
- Frei J, Kramer F. Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform. 2023;145:104478.
- 5.
- Meineke F, Modersohn L, Loeffler M, Boeker M. Announcement of the German Medical Text Corpus Project (GeMTeX). Stud Health Technol Inform. 2023;302:835–6.