gms | German Medical Science

49. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds)
19. Jahrestagung der Schweizerischen Gesellschaft für Medizinische Informatik (SGMI)
Jahrestagung 2004 des Arbeitskreises Medizinische Informatik (ÖAKMI)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie
Schweizerische Gesellschaft für Medizinische Informatik (SGMI)

26. bis 30.09.2004, Innsbruck/Tirol

An Annotated German-Language Medical Text Corpus

Meeting Abstract (gmds2004)

Suche in Medline nach

  • corresponding author presenting/speaker Joachim Wermter - Universitätsklinikum, Freiburg, Deutschland
  • Udo Hahn - Universität, Freiburg, Deutschland

Kooperative Versorgung - Vernetzte Forschung - Ubiquitäre Information. 49. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds), 19. Jahrestagung der Schweizerischen Gesellschaft für Medizinische Informatik (SGMI) und Jahrestagung 2004 des Arbeitskreises Medizinische Informatik (ÖAKMI) der Österreichischen Computer Gesellschaft (OCG) und der Österreichischen Gesellschaft für Biomedizinische Technik (ÖGBMT). Innsbruck, 26.-30.09.2004. Düsseldorf, Köln: German Medical Science; 2004. Doc04gmds168

Die elektronische Version dieses Artikels ist vollständig und ist verfügbar unter:

Veröffentlicht: 14. September 2004

© 2004 Wermter et al.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen ( Er darf vervielf&aauml;ltigt, verbreitet und &oauml;ffentlich zug&aauml;nglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.




In our lab, research efforts are directed at the implementation of information extraction systems for the medical field [1]. In order to meet clinical requirements, robustness of medical language analysis is a key issue in rendering systems for routine use. In order to adapt our medical language processing (MLP) facilities to medical jargon in a more fault-tolerant manner, we created a training and test environment with a major medical language resource component: a large part-of-speech (POS) annotated German-language medical text corpus. In order to evaluate the benefit of such a language resource, we trained and tested a POS tagger on it.

Corpus Description

FraMed, the FReiburg Annotated MEDical text corpus, combines a large variety of German-language medical text genres with a focus on clinical reports and runs about 100,150 tokens in size. The clinical text genres, taken from the Universitätsklinikum Freiburg, cover discharge summaries, as well as pathology, histology and surgery reports. The non-clinical ones consist of medical expert texts (taken from a medical textbook [2]) and health care consumer texts taken from a health-centered Web portal.

Table 1 [Tab. 1] shows a quantitative account in terms of the number of sentences, text tokens and types distributed over the different genres, average sentence lengths (ASL) and the normalized token/type ratios (TTR). For purposes of comparison, we also included these measures for a random sample of approximately 100,000 tokens taken from NEGRA, a 355,000 token-sized annotated German newspaper corpus [3].

A look at the individual clinical text genres reveals that there is a great deal of variability in average sentence length, as witnessed by the enormous standard deviations (in parentheses). Except for surgery reports, the standard deviation always amounts to at least two thirds of the average sentence length. This may very well have to do with the way we defined the notion of sentence, viz. as a consecutive array of tokens delimited by a period, a question mark or an exclamation mark. The non-clinical text genres show a less dramatic deviation. This does not come as a surprise, because these medical text genres are of a more standardized and carefully produced sort, both linguistically and stylistically. Especially textbook texts seem to be in line with the newspaper material, both in terms of the average sentence length and the less pronounced standard deviation. The token/type ratio is a measure in corpus statistics that usually indicates the variety of vocabulary in a text. It is well-known that this measure becomes less reliable when comparing texts of different sizes. Therefore, we normalized the medical texts and the newspaper material by taking a random 7138-token sized sample (the size of the discharge summaries) of each genre. Of the medical texts, histology reports exhibit the lowest vocabulary diversity (4.8 tokens per type), whereas the textbook material shows the highest one (3.3 tokens per type). Of all clinical genres, discharge summaries show the most variation (3.4 tokens per type). This can be attributed to the fact that they are the most articulate and prosaic of all clinical document genres, both in terms of their linguistic form and contents. Interestingly though, comparing the TTR values of the Framed document types against the newspaper one, their sublanguage character becomes evident: the newspaper material shows substantially more variation in vocabulary than any of the medical text genres or FraMed as a whole.

Annotating the FraMed Text Corpus

The manual linguistic annotation of text corpora is a prerequisite for the development of standard NLP tools, such as POS taggers, phrase chunkers, syntactic parsers, grammar and lexicon learners. Up until now, the creation of these kinds of resources has almost exclusively focused on general-language newspaper and newswire genres. The POS annotation of FraMed is meant to fill this gap for a particular sublanguage domain, viz. German medical language. For our annotation purposes, we took STTS [4], a standard general-language tagset for German comprised of 54 tags.

An introspection into the variety of clinical and non-clinical texts indicates that medical language has some unique properties. Among them are the use of Latin and Greek terminology (sometimes also mixed with the host language, here German), various ad hoc forms for abbreviations and acronyms, a variety of (sometimes idiosyncratically used) measure units, enumerations, and some others. The question arises whether these are characteristic or just marginal sublanguage properties. Thus, for our tagging purposes, we enhanced the standard STTS tagset with three novel tags which are intended to capture some of the ubiquitous properties in medical texts not covered by a general-purpose tagset: the extended STTS-MED tagset. POS tags from both the standard Stts and the extended STTS-MED tagset are shown in Table 2 [Tab. 2].

Training and Testing a Part-of-Speech Tagger

We evaluated the MLP benefit of our annotated medical corpus by means of a statistical POS tagger (TNT [5]). Up until now, the top performance of taggers for various languages, mostly trained on newspaper corpora such as the Negra corpus, varied between 96% to slightly over 97%. We compared TNT's tagging performance with respect to the general-purpose newspaper language domain and the medical sublanguage domain. For this purpose, the tagger was trained on the 100.000-token random sample of the NEGRA corpus with the standard STTS tagset, and on the FraMed medical corpus using the extended STTS-MED tagset. The tests were performed on partitions of the corpora that use 90% as training set and 10% as test set, so that the test data was guaranteed to be unseen during training. This process was repeated ten times (ten-fold cross-validation), each time using a different 10% as the test set, and the single outcomes were then averaged. For the FraMed run, we achieved a tagging accuracy of 98%, whereas for the Negra run, the result was only 95,7%. Hence, taggers trained on sublanguage domain data perform substantially better than taggers trained general-purpose language data. Thus, specialized sublanguage resources like FraMed seem to be a valuable asset for re-training and newly developing robust and effective language tools for important MLP applications such as routine information extraction from clinical documents or medical question answering services.


Hahn U, Romacker M, Schulz S. MedSyndicate: A natural language system for the extraction of medical information from findings reports. Int J of Med Inf 2002; 67(1/3): 63-74.
MSD: Manual der Diagnostik und Therapie [CD-ROM], 5th ed; München: Urban & Schwarzenberg; 1993
Brants T, Skut W, Uszkoreit H. Syntactic annotation of a newspaper corpus. In: Abeillé A (ed.) Treebanks: Building and Using Parsed Corpora. Kluwer; 2003: 73-87.
Thielen C, Schiller A. Ein kleines und erweitertes Tagset fürs Deutsche. In: Feldweg H, Hinrichs H (eds.) Lexikon und Text. Wiederverwendbare Methoden und Ressourcen zur linguistischen Erschließung des Deutschen. Niemeyer; 1996: 193-204.
Brants T. TNT: A statistical part-of-speech tagger. Proc of the 6th Conf on Applied Natural Language Processing 2000: 224-231.