### Article

## Extracting Multi-Word Terminology from Biomedical Text

### Search Medline for

### Authors

Published: | September 8, 2005 |
---|

### Outline

### Text

#### Introduction

With proliferating volumes of medical and biological text available, the need to extract and manage domain-specific terminologies has become increasingly relevant

in the recent years. Most available terminological dictionaries, however, are still far from being complete, and what’s worse, a constant stream of new terms enters via the ever-growing biomedical literature. Thus, there have been many studies examining various methods for automatic term recognition (ATR) from biomedical literature. Typically, such approaches make use of various degrees of linguistic filtering (e.g., part-of-speech tagging, phrase chunking etc.), through which candidates of various linguistic patterns are identified (e.g. *noun-noun, adjective-noun-noun *combinations etc.). These candidates are then submitted to frequency- or statistical-based evidence measures (e.g., C-value [Ref. 1]) which compute weights indicating to what degree a candidate qualifies as a terminological unit. The purpose of our study is to present a novel term recognition measure which directly incorporates a vital linguistic property of terms, namely their limited *paradigmatic modifiability*, and in evaluating it against some of the standard procedures, we show that it substantially outperforms them on the task of trigram term extraction from the biomedical text

#### Materials and Methods

**The Training Set.** We collected a biomedical training corpus of approximately 490,000 MEDLINE abstracts. We then annotated this 114-million-word corpus with the GENIA part-of-speech tagger and identified noun phrases (NPs) with the YAMCHA-Chunker. In this study, we restricted ourselves to NP recognition because the vast majority of biomedical terminology (and terms in general) is contained within noun phrases. In order obtain our term candidate sets, we counted the frequency of occurrence of noun phrases in our training corpus and categorized them according to their length. For this study, we collected 34,165 noun phrase candidate types of length 3 only considering candidates with frequency *f > 7*.

**Biomedical Terminology.** Terms are usually referred to as the linguistic surface manifestation of concepts. For our purposes of evaluating the qualtity of different measures in recognizing biomedical multi-word terminology from the biomedical literature, we take every trigram candidate to be a biomedical term (i.e. a true positive) if it was found in the 2004 UMLS METATHESAURUS [Ref. 2]. For example, the word trigram “*phorbol myristate acetate*” is listed as a term in one of the UMLS vocabularies, *viz*. MESH [Ref. 3], whereas “*breast cancer patients*” is not. Thus, among the 34,165 word trigram candidate types, 3,482 (10.2%) were identified as terms.

**Paradigmatic Modifiability of Terms.** The linguistic property around which we built our measure of termhood is the *limited paradigmatic modifiability *of multi-word terminological units. For example, a trigram multi-word expression such “*phorbol myristate acetate*” contains three word/token slots in which slot 1 is filled by “*phorbol*”, slot 2 by “*myristate*” and slot 3 by “*acetate*”. The *limited paradigmatic modifiability, P-Mod*, of such a trigram is now defined by the probability with which one or more such slots *cannot *be filled by other tokens, i.e., the tendency not to let other words appear in various slot combinations. To arrive at the various combinatory possibilities that fill these slots, the standard combinatory formula without repetitions *n* over *k* (an unordered selection) can be used where *1 < k <= n*. Then, for each commbinatory possibility for a particular *k* (here, *k* is actually a placeholder for any possible word/token and its frequency), the *k-modifiability, mod-k*, is computed. Then, the paradigmatic modifiability, *P-Mod*, of an n-gram is the product of all its k-modifiabilities:

**Methods of Evaluation.** We examine various *m*-highest ranked portions of the candidates returned by a partiuclar measure, which allows for the plotting of standard precision and recall graphs for the whole candidate set. We evaluate our *P-Mod* measure against the widely used C-value and also against the t-test measure, which yields good results in general-language collocation extraction studies. Our baseline is defined by the proportion of true positives (i.e., the proportion of terms) in our trigram candidate set, which is equivalent to the likelihood of finding one by blindly picking from it.

#### Results

For the word trigram candidate set, we incrementally examined portions of the ranked candidate list returned by the each of the three measures we considered.

The precision values for the various portions were computed such that for each percent point of the list, the proportion of true positives (i.e., terms contained in the UMLS Metathesaurus [Ref. 2]) was compared to the overall number of candidate items returned. This yields the (descending) precision curves in figure 1 [Fig. 1].

First, we observe that all measures outperform the baseline by far, and, thus, all are potentially useful measures of termhood. As can be clearly seen, however, our *P-Mod *measure substantially outperforms all other measures at all points examined. If we look at the corresponding values and consider 1% of the trigram list (i.e., the first 342 candidates), the precision value for P-Mod is 0.56 and thus 0.12 points higher than for t-test (0.44) and C-value (0.44). With increasing portions of the ranked lists considered, the precision curves start to converge toward the baseline, but P-Mod maintains a steady advantage.

The (ascending) recall curves in the figure above indicate which *proportion of all true positives *is identified by a particular measure at a certain point of the ranked list. In this sense, recall is an even better indicator of a particular measure’s performance. Again, our linguistically motivated terminology extraction algorithm outperforms all others, and its gain is even more pronounced than for precision. In order to get a 50% recall *P-Mod *only needs to winnow 19% of the list, whereas the other two measures need to scan 10 additional percentage points. In order to obtain a 70% recall *P-Mod *only needs to analyze 36%, and the second-placed t-test already 51% of the ranked list. For a 90% recall, this relation is 65% (*P-Mod*) to 80% (t-test).

#### Discussion and Conclusions

We proposed a new terminology identification algorithm and showed that it substantially outperforms some of the standard measures in distinguishing terms from non-terms in the biomedical literature. While mining biomedical text for new terms and assembling these in controlled vocabularies is an overall complex task involving several components, one essential building block is a measure indicating the *degree of termhood *of a candidate. In this respect, our study has shown that an algorithm incorporating a vital linguistic property of terms, *viz*. their limited *paradigmatic modifiability*, can be a much more powerful part of a terminology extraction system (like, e.g., proposed by [Ref. 4]) than the standard measures typically employed.

In general, a high-performing biomedical term identification system is not only valuable for collecting new terms per se but is also essential in updating already

existing terminology resources. As a concrete example, the term “*cell cycle*” is contained in MESH and the term “*cell cycle arrest protein BUB2*” in the MESH supplementary concept records which include many proteins with a GenBank identifier. The word trigram *cell cycle arrest*, however, is not included in MESH although it is ranked in the top 5% of *P-Mod*. Utilizing this prominent ranking, the missing semantic link can be established between these two terms, both by including the trigram in the MESH hierarchy and by linking it via UMLS to the GENE ONTOLOGY (GO), in which it is listed as a stand-alone term.

### References

- 1.
- Frantzi K, Ananiadou S, Mima H. Automatic Term Recognition of Multi-Word Terms: the C/NC value method . Jorunal of Digital Libraries 2000; 3(2): 115-130.
- 2.
- UMLS. Unified Medical Language System. Bethesda, MD: National Library of Medicine; 2004.
- 3.
- MESH. Medical Subject Headings. Bethesda, MD: National Library of Medicine; 2004.
- 4.
- Nenadic G, Spasic I, Ananiadou S. Terminology-Driven Mining of Biomedical Literature. Journal of Biomedical Informatics 2003; 13: 1-6.