gms | German Medical Science

GMDS 2013: 58. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

01. - 05.09.2013, Lübeck

Biomedical Term Acquisition Based on Aligned Parallel Corpora

Meeting Abstract

Search Medline for

  • Johannes Hellrich - Friedrich-Schiller Universität Jena, Jena, DE
  • Udo Hahn - Friedrich-Schiller Universität Jena, Jena, DE

GMDS 2013. 58. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Lübeck, 01.-05.09.2013. Düsseldorf: German Medical Science GMS Publishing House; 2013. DocAbstr.198

doi: 10.3205/13gmds098, urn:nbn:de:0183-13gmds0981

Published: August 27, 2013

© 2013 Hellrich et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en). You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.


Outline

Text

Introduction and Objectives: Creating and maintaining biomedical terminologies is known to be a resource-heavy task, mostly carried out by human experts. There exists a stunning variety of broad-coverage terminological resources for the English language. However support for many European languages, such as French, Spanish or German is lacking. For example the non-English counterparts of the Unified Medical Language System (UMLS) [1] have only 30–60% of the coverage of the English UMLS. In this paper, we report on efforts to partially close this gap by combining off-the-shelf solutions for statistical machine translation (SMT) and biomedical named entity recognition (NER). Our solution produced 7992 potential new German entries for the UMLS, of which about 75% can be assumed to be correct. Other groups have done similar research in the past, yet their systems relied on customized components [2].

Material and Methods: Our approach is language-independent and is based on exploiting phrase alignments from parallel corpora. Our German-English parallel corpus consists of 719k bilingual Medline titles and 364k sentences from the EMEA corpus [3], all annotated for those entities already contained in the UMLS. 10% of the corpus were used to train the JCoRe NER system [4] which found biomedical entities with an F-score of 0.78 during ten-fold cross-validation on these texts. The remaining 90% of the corpus were used to train a phrase-based SMT model with the GIZA++ tool. By restricting the SMT model to English biomedical terms one can find German translation candidates for them. We used a maximum entropy model to filter the candidates, using the NER systems judgment on their entity status and the probability values in the SMT model as features.

Results: We evaluated both the systems ability to reconstruct parts of the UMLS and the quality of new translations provided by it. We measured the former by selecting those UMLS concepts contained in our parallel corpus and comparing the translations provided by our system with the canonical ones, achieving a F-score of 0.72. We measured the latter by providing a biomedical expert with 100 translations proposed by our system and not yet contained in the UMLS. According to her judgment 75% of these are desirable additions to the German UMLS.

Discussion: A qualitative analysis of the sampled translations reveals several error classes. We primarily observed segmenting errors (8% of all inspected translations, e.g. ”Frauen” as a translation for ”pregnant women”), followed by too narrow or broad translations (7%, e.g. ”Polyvinylalkohol” for ”Polyvinyl”), non-explainable errors (6%, e.g. ”schwanger” for ”grapefruit”) and undesired inflected forms (3%, e.g. ”extrapulmonalen Tuberkulose”, the UMLS contains only nominative forms). To weed out those erroneous translations we plan to explore stemming to remove inflected forms and semantic clues for too narrow or broad translations by using the hierarchic information provided by the UMLS. A broader coverage could be achieved by incorporating more texts from other sources, like webpages or patents.


References

1.
Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic acids research. 2004;32(Database issue):D267–D270.
2.
Déjean H, Gaussier E, Renders JM, Sadat F. Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine. 2005;33(2):111–124.
3.
Tiedemann J. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In: Recent Advances in Natural Language Processing. vol. V; 2009. p. 237–248.
4.
Hahn U, Buyko E, Landefeld R, Mühlhausen M, Poprat M, Tomanek K, et al. An overview of JCoRe, the JULIE Lab UIMA component repository. In: LREC’08 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP‘. 2008. p. 1–7.