Artikel
Information extraction from German clinical documents in neurodegenerative diseases
Suche in Medline nach
Autoren
Veröffentlicht: | 29. August 2017 |
---|
Gliederung
Text
Background: In hospitals, despite the existence of hospital information systems, a high amount of valuable patient information is documented in full text only. From a data mining point of view, it cannot be exploited until it is transformed into standardized, machine-readable, structured information. We adapted a generic text mining environment implemented as virtual machines to be easily set up as part of a data integration architecture within the University Hospital Bonn (UKB). As application, we focus on structuring data in the area of neurodegenerative diseases (NDD).
Aim of the study: The IDSN consortium (Integrative Datensemantik für die Neurodegenerative Forschung) aims for the integration of NDD-related data from diverse sources, including clinical routine documents by text mining.
Here, we present two information extraction pipelines aiming to retrieve defined items from German letters of discharge and neuropsychological test reports (NPTs). End point of the extraction processes is a mapping into the DZNE-DESCRIBE database scheme which is designed to collect anamnestic data via structured interviews.
Proposed methods: TM Infrastructure: The text mining service is deployed as a set of virtual machines (VMs) in the clinic infrastructure. In short, a Broker VM is used to queue messages, which are then forwarded to the specific Worker VMs for processing. In the NDD use case, two Worker VMs, have been set up. In the first VM, tables with cognitive test battery information are extracted from the NPTs and directly transformed into the DESCRIBE model. The second VM extracts five classes of common disturbances in dementia and temporal information. It contains a generic NLP processes involving segmentation (tokens, word decomposer, sentences, paragraphs), stemming and assertion recognition such as negation. For named entity recognition, terminology was developed based on the training corpus and expert knowledge. For information extraction, rules are written in RUTA syntax. Document readers and an ODM writer connect the workflow to the hospital IT systems.
Training corpus and annotation: For the NPTs with a very structured format, only 10 records from different creation dates were used as initial corpus. For generation of training data from discharge letters, the documents are anonymised and represented to the user in a web-based annotation interface (BRAT). In BRAT, the classes to be extracted are annotated by the user. Furthermore, anonymization results could be inspected and corrected.
Points for discussion: The effort and grade of success of information extraction approaches is highly use case-dependent. For information which is already available in a structured way in text documents (such as our NPTs), the investment in setting up extraction pipelines is very low while gaining high quality structured information (e.g. longitudinal patient data from cognitive test batteries). In other cases, were anamnesis information and temporal data has to be extracted from discharge letters, efforts for setting up a corresponding workflow are higher. Main investments are the annotation of training data, terminology enrichment and adaptation/training of the extraction methods.
Taken together, our semantic information integration approach narrows the gap between unstructured resources and its automated use, finally providing longitudinal data on dementia patients.
Die Autoren geben an, dass kein Interessenkonflikt besteht.
Die Autoren geben an, dass kein Ethikvotum erforderlich ist.
References
- 1.
- Fluck J, Senger P, Ziegler W, Claus S, Schwichtenberg H. The cloud4health project: Secondary Use of Clinical Data with Secure Cloud-based Text Mining Services. In: Griebel M, Schüller A, Schweitzer MA, eds. Scientific Computing and Algorithms in Industrial Simulations - Projects and Products of Fraunhofer SCAI. Springer Series.
- 2.
- Starlinger J, Kittner M, Blankenstein O, Leser U. How to improve information extraction from German medical records. Information Technology. 2016;58. DOI: 10.1515/itit-2016-0027
- 3.
- Faßbender T, Riede C, Daumke P, Honrado A, Kreuzthaler M, Lopez-Garcia P, Schulz S, van Mulligen E, Kors J, van Haagen H, Gonna H, Wang X, Behr E. SEMCARE – Semantic Data Platform for Healthcare. In: GMDS 2015. 60. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Krefeld, 06.-09.09.2015. Düsseldorf: German Medical Science GMS Publishing House; 2015. DocAbstr. 118. DOI: 10.3205/15gmds024
- 4.
- Fluck J, Senger P, Griebel L, Leb I. Extraction of TNM Codification from German Pathology Reports. In: GMDS 2014. 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Göttingen, 07.-10.09.2014. Düsseldorf: German Medical Science GMS Publishing House; 2014. DocAbstr. 369. DOI: 10.3205/14gmds064
- 5.
- Senger P, Klenner A, Fluck J. A Business Logic System for Mining German Patient Records. In: GMDS 2013. 58. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Lübeck, 01.-05.09.2013. Düsseldorf: German Medical Science GMS Publishing House; 2013. DocAbstr.248. DOI: 10.3205/13gmds056