gms | German Medical Science

68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

17.09. - 21.09.23, Heilbronn

Assisting expert semantic coding of intensive care metadata with state-of-the-art transformer based large language models

Meeting Abstract

  • Niko Möller-Grell - Institut für Medizinische Informatik, Universität Heidelberg, Heidelberg, Germany
  • Matthias Ganzinger - Institut für Medizinische Informatik, Universität Heidelberg, Heidelberg, Germany
  • Martin Dugas - Institut für Medizinische Informatik, Universität Heidelberg, Heidelberg, Germany
  • Christian Niklas - Institut für Medizinische Informatik, Universitätsklinikum Heidelberg, Heidelberg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). Heilbronn, 17.-21.09.2023. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocAbstr. 186

doi: 10.3205/23gmds123, urn:nbn:de:0183-23gmds1233

Veröffentlicht: 15. September 2023

© 2023 Möller-Grell et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Semantic annotation of electronic health records (EHR) in a structured way improves interoperability in patient care and research. The Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM) XML standard allows syntactic structuring and semantic annotation of metadata. Whereas other standards such as the Fast healthcare interchange resources (FHIR) play an important role in clinically, CDISC ODM is widely used in clinical research [1]. EHR data from intensive care unit (ICU) settings typically include large parameter sets to capture characteristics of complex diseases. Manual annotation with controlled vocabularies is difficult and time-consuming. To automate semantic annotation, transformer-based natural language processing (NLP) models are promising due to representation of contextual medical knowledge within the model.

State of the art: Current approaches to EHR annotation include manual, semi-automated or rule-based methods and transformer-based models for free text annotation. A semi-automated approach for the re-use of annotations [2] was described in the context of a pragmatic metadata repository (MDR). The pragmatic MDR divides ODMs into atomic elements, aggregating similar types of elements and allows frequency-based search for United Medical Language System (UMLS) concepts. The transformer model Medical Annotation tool (MedCat) provides an automated annotation approach trained on large amounts of annotated medical free texts [3].

Concept: We will convert unstructured parameter sets from the Amsterdam University Medical Center Database Heidelberg University Hospital and the Bern University Hospital and Swiss Federal Institute of Technology’s (ETH) HiRID dataset [4] as well as the ICU parameters of MIMIC III from comma-separated values format into structured CDISC ODM, and harmonise languages into English through a transformer based approach.

Semantic annotation will be performed using the rule-based annotation of the pragmatic MDR [2], large transformer models re-trained on named entity recognition and linking (NER+L), and a subset annotated by medical practitioners as a gold standard.

Re-tenraining of the transformers will be evaluated in standardised manner by using the precision, recall and F1 metrics commonly used in the evaluation of NLP tasks.

To compare the performance of the entity linking (EL), Krippendorff's alpha will be used to measure semantic distances.

Implementation: We developed a java-based conversion tool that can implement the syntactic schema of the CDISC ODM standard using semi-structured data in list form. The transformer models MedCat [3] and GPT Neo-XT base 20B will be re-trained on data from a large, fully annotated MDR and data from the Medical Information Mart for Intensive Care (MIMIC)-III database [5]. To ensure a meaningful comparison between the different sites and annotation approaches, core data elements for a clinical use case will be created. A subset of this core dataset will be manually annotated by medical practitioners to create a benchmark. After re-training independent medical practicioners will evaluate annotation performance in a structured manner. Preliminary results for a pilot study of this evaluation will be presented at the conference.

Lessons learned: Manual annotation is time-consuming, limiting the volume of metadata that can be annotated in given time. Using the rule-based approach significantly reduces the required time on task. Re-training the transformer models requires high computing resources exceeding conventional capabilities but promise to improve coding rate and accuracy. As a result, high performance computing infrastructures will be employed.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Hume S, Aerts J, Sarnikar S, Huser V. Current applications and future directions for the CDISC Operational Data Model standard: A methodological review. J Biomed Inform. 2016 Apr 1;60:352–62. DOI: 10.1016/j.jbi.2016.02.016 Externer Link
2.
Hegselmann S, Storck M, Gessner S, Neuhaus P, Varghese J, Bruland P, Meidt A, Mertens C, Riepenhausen S, Baier S, Stöcker B, Henke J, Schmidt CO, Dugas M. Pragmatic MDR: a metadata repository with bottom-up standardization of medical metadata through reuse. BMC Med Inform Decis Mak. 2021 May 17;21(1):160. DOI: 10.1186/s12911-021-01524-8 Externer Link
3.
Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, Mascio A, Zhu L, Folarin AA, Roberts A, Bendayan R, Richardson MP, Stewart R, Shah AD, Wong WK, Ibrahim Z, Teo JT, Dobson RJB. Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artif Intell Med. 2021 Jul;117:102083. DOI: 10.1016/j.artmed.2021.102083 Externer Link
4.
Faltys M, Zimmermann M, Lyu X, Hüser M, Hyland S, Rätsch G, et al. HiRID, a high time-resolution ICU dataset (version 1.1.1). PhysioNet. 2021.
5.
Johnson A, Pollard T, Mark Ro. MIMIC-III Clinical Database v1.4. PhysioNet. 2016 [cited 2023 Jun 7]. Available from: https://physionet.org/content/mimiciii/1.4/ Externer Link