gms | German Medical Science

49. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds)
19. Jahrestagung der Schweizerischen Gesellschaft für Medizinische Informatik (SGMI)
Jahrestagung 2004 des Arbeitskreises Medizinische Informatik (ÖAKMI)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie
Schweizerische Gesellschaft für Medizinische Informatik (SGMI)

26. bis 30.09.2004, Innsbruck/Tirol

Cross-Language Document Retrieval with MorphoSaurus

Meeting Abstract (gmds2004)

  • corresponding author presenting/speaker Kornel Marko - Institut für Med. Biometrie und Med. Informatik, Universitätsklinikum Freiburg, Freiburg, Deutschland
  • Stefan Schulz - Institut für Med. Biometrie und Med. Informatik, Universitätsklinikum Freiburg, Freiburg, Deutschland
  • Joachim Wermter - Institut für Med. Biometrie und Med. Informatik, Universitätsklinikum Freiburg, Freiburg, Deutschland
  • Michael Poprat - Arbeitsgruppe Computerlinguistik, Universität Freiburg, Freiburg, Deutschland
  • Udo Hahn - Arbeitsgruppe Computerlinguistik, Universität Freiburg, Freiburg, Deutschland

Kooperative Versorgung - Vernetzte Forschung - Ubiquitäre Information. 49. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds), 19. Jahrestagung der Schweizerischen Gesellschaft für Medizinische Informatik (SGMI) und Jahrestagung 2004 des Arbeitskreises Medizinische Informatik (ÖAKMI) der Österreichischen Computer Gesellschaft (OCG) und der Österreichischen Gesellschaft für Biomedizinische Technik (ÖGBMT). Innsbruck, 26.-30.09.2004. Düsseldorf, Köln: German Medical Science; 2004. Doc04gmds065

The electronic version of this article is the complete one and can be found online at:

Published: September 14, 2004

© 2004 Marko et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.




Medical information retrieval (IR) presents a unique combination of challenges for the design and implementation of retrieval engines [1]. First of all, clinical document collections and medical databases are usually very large and dynamic. Second, medical document collections are truly multi-lingual. While clinical documents are typically written in the physicians' native language, searches in major bibliographic databases such as MEDLINE require sophisticated knowledge of (expert-level) English medical terminology which most non-English speaking physicians do not have. Third, medical terminology is morphologically extremely productive, characterized by a typical mix of Latin and Greek roots with the corresponding host language, e.g. in words such as pseudohypoparathyroidism, Gastrointestinaltrakt, etc. Obviously, dealing with such phenomena is crucial for any medical IR system.

We respond to these challenges in terms of the MORPHOSAURUS system (an acronym for MORPHeme and the SAURUS). At its core lies a special type of dictionary, in which the entries are equivalence classes of subwords, i.e., semantically minimal units [2]. These equivalence classes capture intralingual as well as interlingual synonymy. We contrast a retrieval approach which is based upon the preprocessing and transformation of medical documents (and queries) based upon the MorphoSaurus system to one which relies on the direct translation of non-English, i.e., German and Portuguese, queries to English ones for subsequent processing on an English document collection. We evaluate these two fundamentally different approaches on a large medical document collection (the Ohsumed corpus [3]).

Morpho-semantic Normalization

Figure 1 [Fig. 1] depicts how source documents are converted into a morpho-semantically normalized, interlingual representation by a three-step procedure. The first step deals with orthographic normalization. A preprocessor reduces all capitalized characters from input documents to lower-case characters and, additionally, performs language-specific character substitutions (e.g. the replacement of German umlauts). The next step in the pipeline is concerned with morphological segmentation. The system segments the input stream into a sequence of semantically plausible sublexical items, corresponding to subwords as found in the lexicon. Currently, the subword lexicon contains about 57,000 entries with 21,000 for both German and English and 15,000 for Portuguese. In the final step, semantic normalization, each content bearing subword recognized is substituted by its corresponding equivalence class (called MorphoSaurus identifier - MID). After that step, all synonyms within a language and all translations of semantically equivalent subwords from different languages are represented by the same code in that target representation.

Experimental Settings

Our experiments were run on the Ohsumed corpus [3], which constitutes one of the standard IR testbeds for the medical domain. Ohsumed is a subset of the MEDLINE database. Considering those documents which contained abstracts (some did not), we obtained a document collection comprised of 233,445 texts with 41 million tokens, in total. Since the Ohsumed corpus was created specifically for IR studies, 106 queries are available, including associated relevance judgments. The following is a typical query: "Are there adverse effects on lipids when progesterone is given with estrogen replacement therapy?" Since the Ohsumed corpus contains only English-language documents the question arises how this collection (or MEDLINE, in general) can be accessed from other languages as well.

Query translation (QTR) can be regarded as a standard, and often preferred experimental procedure in the cross-language retrieval community [4]. In our experiments, the original English queries were first translated into Portuguese and German by medical experts (native speakers of Portuguese or German, with a very good mastery of both general and medical English). In the second step, the manually translated queries were re-translated into English using the Google Translator. Additionally, for covering the medical sublanguage, we used a bilingual lexeme dictionary derived from the UMLS Metathesaurus [5] with about 26,000 German-English entries and 14,200 entries for Portuguese-English. As an alternative to QTR, we probed the MorphoSaurus indexing approach (MSI). Unlike QTR, the normalization of documents and queries yields a language-independent, semantically normalized index format. As the baseline for our experiments, we provide a retrieval system operating with a word stemmer and a stopword list running on (original) English documents with (original) English queries. For an unbiased evaluation, we basically used a simple Boolean search approach incorporating adjacency metrics.


It is not surprising that the English-English baseline performs best with an 11pt average (a standard metrics in IR) of 0.14 (cf. [Tab. 1]). The German-English MSI approach result is almost on a par with the baseline (0.01 less (0.13)), whereas the German-English QTR result is more than 0.05 points worse (0.09). This means that the MSI approach achieved 93% of the baseline performance (quite a high score given cross-language IR standards), whereas the QTR approach scored far lower (62%). This difference turns out to be less dramatic, but still noticeable, in comparing the Portuguese-English MSI and QTR results with the baseline (68% for MSI and 54% for QTR, hence, 14 percentage points difference). Both the MSI and the QTR 11pt averages are much lower for the Portuguese-English retrieval case. In any case, it seems worth noting that at no single recall point QTR values were higher than MSI values. Hence, the latter consistently outperformed the former on both languages. Interesting from a realistic retrieval perspective is the average gain on the top two recall points. In Table 1 [Tab. 1] the Portuguese-English MSI condition achieves a precision of 0.26 (72% of the baseline), the German-English condition yields a precision value of 0.32 (90% of the baseline) for MSI.


The success of dictionary-based cross-language IR largely depends on the coverage of underlying lexicons. We optimize the lexical coverage by limiting the lexicon to semantically relevant subwords. Based on this architecture we presented an interlingua approach to cross-language information retrieval on a medical document collection. Compared to state-of-the-art direct translation techniques we achieved a remarkable benefit, at least for German by reaching 93% of the English baseline.


This work was partly funded by Deutsche Forschungsgemeinschaft (DFG), grant Klar 640/5-1.


Hersh WR. Information Retrieval. A Health and Biomedical Perspective. New York: Springer, 2002.
Schulz S, Honeck H, Hahn U. Indexing Medical WWW Documents by Morphemes. Proc. of 10th World Congress on Med. Informatics - MEDINFO 2001. Amsterdam: IOS Press, 2001: 266-270.
Hersh WR, Buckley C, Leone TJ, Hickam DH. OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. Proceedings 17th Intl. ACM SIGIR Conference; 1994: 192-201
Eichmann D, Ruiz ME, Srinivasan P. Cross-language Information Retrieval with the UMLS Metathesaurus. Proc. 21st Intl. ACM SIGIR Conference on Research and Development in Information Retrieval; 1998:72-80
UMLS. Unified Medical Language System. Bethesda, MD: National Library of Medicine, 2003.