gms | German Medical Science

GMDS 2015: 60. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

06.09. - 09.09.2015, Krefeld

SEMCARE – Semantic Data Platform for Healthcare

Meeting Abstract

  • Thomas Faßbender - Averbis GmbH, Freiburg, Deutschland
  • Claudia Riede - Averbis GmbH, Freiburg, Deutschland
  • Philipp Daumke - Averbis GmbH, Freiburg, Deutschland
  • Angel Honrado - Synapse Research Management Partners S.L., Barcelona, Spain
  • Markus Kreuzthaler - Institut für Medizinische Informatik, Statistik und Dokumentation, Medizinische Universität Graz, Graz, Österreich
  • Pablo Lopez-Garcia - Institut für Medizinische Informatik, Statistik und Dokumentation, Medizinische Universität Graz, Graz, Österreich
  • Stefan Schulz - Medizinische Universität Graz, Graz, Österreich
  • Erik van Mulligen - Erasmus Medisch Centrum, Rotterdam, The Netherlands
  • Jan Kors - Erasmus Medisch Centrum, Rotterdam, Österreich
  • Herman van Haagen - Erasmus Medisch Centrum, Rotterdam, The Netherlands
  • Hanney Gonna - St. George's University of London, London, Great Britain
  • Xinkai Wang - St. George's University of London, London, Great Britain
  • Elijah Behr - St. George's University of London, London, Great Britain

GMDS 2015. 60. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Krefeld, 06.-09.09.2015. Düsseldorf: German Medical Science GMS Publishing House; 2015. DocAbstr. 118

doi: 10.3205/15gmds024, urn:nbn:de:0183-15gmds0243

Veröffentlicht: 27. August 2015

© 2015 Faßbender et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The need for exploiting medical data for secondary use has grown tremendously over the last years. Aggregated patient-level data can support the identification of disease mechanisms and new discovery areas, improve drug safety surveillance and decrease patient recruitment cycle times for clinical trials. Exploiting patient-level data can optimize clinical studies in several ways, e.g., by enabling the definition of appropriate study design or ensuring that inclusion/exclusion criteria map to an existing patient population. As large parts of patient-level data in electronic health records (EHRs) are only available as free text, language technologies are an indispensable prerequisite for this process.

SEMCARE is an EU funded project that is building a semantic data platform able to define patient cohorts based on clinical criteria scattered in heterogeneous resources. Three hospitals from the Netherlands, UK, and Austria are serving as pilot sites for SEMCARE, focusing on cardiology use cases. However, SEMCARE‘s long-term objective is to build flexible information extraction and semantic indexing platform that can be adapted to a broad range of languages and clinical contexts.

Methods: SEMCARE uses and extends the biomedical terminological resources available in the Unified Medical Language System (UMLS) Metathesaurus, which integrates more than 150 biomedical and health-related terminologies, e.g., MeSH, SNOMED CT, RxNorm, ICD-9, ICD-10, and LOINC. Although the UMLS contains a few terminologies that have (partially) been translated in German and Dutch, by far the majority of terms in the UMLS are only available in English. We used UMLS version 2014AA and converted the data to the Open Biomedical Ontologies format for easy integration in the terminology management system that is part of the SEMCARE platform. The terminology management system [1] allows the user to import terminologies, to easily search and browse the terminology hierarchy, and to add or modify concepts and terms. Based on the UMLS and with a focus on the SEMCARE use case, we extended and partly translated several terminologies related to disorders (SNOMED CT, ICD-10) and laboratory tests and procedures (LOINC). Dedicated German and Dutch drug lists were compiled and integrated. We used a semi-automatic approach to translate terms, utilizing Google Translate, which has been shown to perform increasingly well in translating biomedical terminology [2], in combination with manual inspection and correction.

SEMCARE's text mining pipeline [3] is based on the UIMA text mining framework UIMA. The pipeline consists of various taggers executed in sequence to extract information from clinical narratives. It works for various languages including English, German, and Dutch. The pipeline accepts several input formats such as Word, PDF, XML or HL7. An OCR module allows the interpretation of scanned documents. The most important components of our current pipeline are:

  • Sentence Detection and Tokenizing: both rule based and statistical approaches based on OpenNLP trained on biomedical corpora are applied
  • Stemming: the stemming algorithm is based on Snowball
  • Decompounding: we apply a lexicon-based decompounding approach [4], especially for non-English languages
  • Drug annotation, including dosage and regimen
  • Lab Value Detection, including numbers and units
  • Negation Detection with a customized Negex algorithm
  • Date and Time recognition

The search engine provides hybrid semantic and full text search based on Apache Solr. This means that free text queries and concept-based queries can be combined to query clinical data. Concept-based search has an advantage in that the system automatically retrieves synonyms and hyponyms, thus yielding a higher recall. The advantage of free text search is it saves resources because not all medical information needs to be predefined in terminologies and extracted via text mining.

The faceted search interface is designed for clinicians and researchers to identify high-risk patient cohorts based on patient-level criteria. The facets include an age selector, a diagnosis facet, a lab value facet, a drug or medication facet and a specific facet for SEMCARE-specific concepts. Facets can be combined by AND, OR and NOT operators. For numerical values such as lab values, range searches can be applied to detect abnormal values. A typical query could be as follows:

EF ≤35% AND Class I-III Heart Failure Symptoms AND QRS duration 120–149ms but NOT with LBBB

Patients matching these criteria may be suitable for receiving an Implantable Cardioverter Defibrillator (ICD).

Results and Discussion: In the session we will present the semantic data platform developed in SEMCARE based on anonymized clinical data. We will show how this platform is able to identify patient cohorts based on patient-level criteria, much of which found in weakly structured text. In addition, we will present the clinical evaluation of the platform that is currently be tested by three major European hospitals in a cardiology use case. Finally, we want to give an outlook of SEMCARE‘s long-term objective which is to build flexible information extraction and semantic indexing platform that can be adapted to a broad range of languages and clinical contexts.


References

1.
termbrowser.com. http://www.termbrowser.com Externer Link
2.
Schulz S, et al. Machine vs. human translation of SNOMED CT terms. Stud Health Technol Inform. 2013;192:581-584.
3.
Enders F, Simon K, Tomanek K, Markó K, Daumke P. Die Averbis Extraction Platform - Sekundärnutzung klinischer Rohdaten - Technologien, Tools und Anwendungsszenarien. In: Mainz//2011. 56. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds), 6. Jahrestagung der Deutschen Gesellschaft für Epidemiologie (DGEpi). Mainz, 26.-29.09.2011. Düsseldorf: German Medical Science GMS Publishing House; 2011. Doc11gmds396. DOI: 10.3205/11gmds396 Externer Link
4.
Markó K, Schulz S, Hahn U. MorphoSaurus - Design and Evaluation of an Interlingua-based, Cross-language Document Retrieval Engine for the Medical Domain. Methods of Information in Medicine. 2005;44(4): 537-545