gms | German Medical Science

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH)

08.09. - 13.09.2024, Dresden

DE-NERmed: A Named Entity Recognition Model for the Detection of German Medical Entities

Meeting Abstract

Search Medline for

  • Martin Wiesner - Hochschule Heilbronn, Heilbronn, Germany

Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 979

doi: 10.3205/24gmds182, urn:nbn:de:0183-24gmds1824

Published: September 6, 2024

© 2024 Wiesner.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: When processing clinical text, often captured via semi- or unstructured free text fields, Named Entity Recognition (NER, [1]) is essential for the realization of Information Extraction (IE) or Information Retrieval (IR) systems [2]. Nouns and named entities (NE) often act as the most relevant terms for matching, for instance in a database or against an index.

Frei and Kramer presented GERNERMED [3], a model for detecting specific medical NEs such as: (prescribed) drugs, strength, route, form, dosage, frequency, and duration. However, to the best of the author’s knowledge, no broad, non-proprietary NER model has been published that can also detect anatomic (body) structures, symptoms, and diseases as covered by the UMLS [4].

This study demonstrates the feasibility of training an open, medical NER model and evaluates it against real-world, clinical text material.

Methods: For the training phase, a synthetic text corpus was compiled, comprising of close to 88.000 health-related Wikipedia articles and its full texts, dating July 2023. The corpus contained ~2.4mio sentences. Next, nouns and noun phrases in the training corpus were automatically tagged against a UMLS concept list; it included ~56.600 medical NEs translated to German. Technically, the training was conducted with the open-source framework Apache OpenNLP [5], in version 2.3.3. For training a Maximum Entropy NER model, the parameters were chosen as follows:

?????training.algorithm=maxent;training.iterations=150;training.cutoff=3;training.threads=8;language=de;use.token.end=false

The resulting model file was then persisted for re-use in NLP applications to detect named (medical) entities.

For the performance evaluation of the DE-NERmed model, n=101 text fragments were randomly selected from discharge letters, originally created in the Chest Pain Unit at the Heidelberg University Hospital. For inclusion, a text fragment had to consist of at least 20 tokens. After????? preparation of the evaluation corpus, both F1 score, and Accuracy were computed.

Results: Training of the 'DE-NERmed-Wiki_2023-maxent' model was carried out on June 25, 2024 in a cluster environment at the Faculty of Computer Science at Heilbronn University. In its binary form, it requires ~5 GB of RAM at runtime.

The model achieved: F1=0.8761 and an Accuracy of 0.8922 (TP=905; TN=1214; FP=65; FN=191). It detected most of the relevant medical NEs, associated with the cardiology and the general medical domain. Misclassifications occurred primarily for NEs which were representative for both, the general and the medical language.

The DE-NERmed model, demo code with text examples, and accompanying data is made available at: https://github.com/mawiesne/DE-NERmed

Discussion & conclusion: The topical variety, the linguistic properties and annotations, and the recency of the Wikipedia material could not be controlled for this study, due to the volume of the training corpus.

The evaluation sample was restricted to 101 text fragments. Nonetheless, those were sampled from unstructured text material which originated from real-world, clinical text material. The construction of a more extensive NER evaluation corpus offers future research potential.

The technical feasibility of training and practical use of an NER model for the German (bio)medical domain was demonstrated. The pre-annotated training corpus and the binary DE-NERmed model file are contributed for scientific comparisons, or for practical use in NLP software pipelines.

Competing interests: Martin Wiesner is committer and member of the Project Management Committee (PMC) in the OpenNLP project of the Apache Software Foundation.

The authors declare that an ethics committee vote is not required.


References

1.
Li J, Sun A, Han J, Li C. A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions on Knowledge and Data Engineering. 2022;34(1):50–70. DOI: 10.1109/TKDE.2020.2981314 External link
2.
Bay M, Bruneß D, Herold M, Schulze C, Guckert M, Minor M. Term Extraction from Medical Documents Using Word Embeddings. In: Proceedings of the 6th IEEE Congress on Information Science and Technology. 2020. p. 328–333.
3.
Frei J, Kramer F. German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation. JMIR Form Res. 2023;7:e39077. DOI: 10.2196/39077 External link
4.
Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. DOI: 10.1093/nar/gkh061 External link
5.
Apache Software Foundation. Apache OpenNLP [Internet]. 2024 [cited 2024 Apr 25]. Available from: https://opennlp.apache.org External link