GMS | GMDS 2014: 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS) | Extraction of TNM Codification from German Pathology Reports

GMDS 2014: 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

07. - 10.09.2014, Göttingen

Artikel

XML Version

Artikel empfehlen

Extraction of TNM Codification from German Pathology Reports

Meeting Abstract

Suche in Medline nach

J. Fluck - Fraunhofer Institut SCAI, Sankt Augustin
P. Senger - Fraunhofer Institut SCAI, Sankt Augustin
L. Griebel - Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen
I. Leb - Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen

GMDS 2014. 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Göttingen, 07.-10.09.2014. Düsseldorf: German Medical Science GMS Publishing House; 2014. DocAbstr. 369

doi: 10.3205/14gmds064, urn:nbn:de:0183-14gmds0649

Veröffentlicht:	4. September 2014

© 2014 Fluck et al.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.de). Er darf vervielfältigt, verbreitet und öffentlich zugänglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.

Gliederung

Text

Introduction and goals: Human tissues with high quality annotations are a central element to facilitate basic science, clinical research and translational studies. The international standardized classification of malignant tumours (TNM), developed and maintained by the Union for International Cancer Control (UICC) (http://www.uicc.org/) is a cancer staging systems used for all solid tumours. The T component describes the size of the (primary) tumour and whether it has invaded nearby tissue, the N component describes (regional) lymph nodes that are involved, and the M component describes distant metastasis [1]. These tumour classifications are key annotations necessary for finding adequate samples in tumour selections existing in pathology departments and tumour banks. Modern routine documentation systems allow for the structured documentation of pathology samples including tumor classifications. With the structured documentation of tumour classification searching or deriving content overviews in biobanks is easily possible.

Unfortunately in a high number of current pathology documentation systems the tumour classification and other important clinical information is only available as free text information in pathology reports. In the Universitätsklinikum Erlangen for example a new pathology documentation system is already integrated into the routine documentation system but over 500,000 sample documentations are still only available as textual pathology reports. This makes it necessary to include a text mining approach into the ETL (Extract, Transform, Load) processes to extract the tumour classification information and to convert it into a structured machine readable representation.

This paper describes a solution that extracts the TNM codification and converts it into structured information. The described solution makes use of a system of regular expression detecting the different spelling variants of TNM codifications in the text. The text mining approach is integrated into a cloud service developed in the Trusted Cloud project cloud4health (http://www.cloud4health.de/). Further details of the whole extraction process and the architecture can be found in [2].

Materials and methods: A. Corpus of electronic pathology reports: The anonymized German pathology reports were provided by the University of Erlangen-Nürmberg (http://www.uni-erlangen.de/einrichtungen/fakultaeten/med/) and the RHÖN-KLINIKUM AG (http://www.rhoen-klinikum-ag.com). The corpus contains in total 4,000 reports, one for each case. The textual layout differs between each report and between the different departments and hospitals. A manual gold standard containing the correct (sub-) codifications of 96 pathology reports was provided by the University of Erlangen-Nümberg in order to measure the performance of the extraction. Further annotations done by students are available for comparison and error analysis. In addition a guideline was developed for the correct extraction of relevant codes, e.g. to pick the last mentioned TNM classifications if a pathology reports contains more than one.

B. Identification and Extraction of TNM Codifications: A system of nine different regular expressions was developed within the UIMA (Unstructured Information Management Architecture) framework in order to identify texts passages containing the TNM codification in a scalable environment. The developed system is flexible enough to identify full and partial TNM codifications, which is necessary to handle the various different spelling variants across the hospitals. In a second step each identified codification is analyzed and separated by further regular expressions. The algorithm extracts all relevant pre- and suffixes of the T, N, M and optional parts of the code and interprets it. The last step is to identify the last codification in the text by a simple sorting algorithm using the offset of each respective hit in the report. By using a standardized ODM (Operational Data Model) the identified codification is returned from the system.

Results: The above-described approach identifies in a first step the whole TNM expression. An example of such a classification is ‘ypT2 ypN1(2/4) pM1(HEP) L1 V0 Pn0’.

The small letters (y,p) are prefix modifiers describing in this example the diagnosis methods (p= pathologic examination) or state of the tumour (y = status after chemotherapy). The upper case TNM letters are followed by the class and additional information in parenthesis. In correct interpretation in the example above would be T=2 (tumour size =2cm), N=1 (2/4) (tumor spreads to closest or small number of regional lymph nodes, 2 out of 4) and M=1(HEP) (distant metastases, liver). In addition to the main classes other information is optional. In the example information on invasion into lymphatic vessels (L), into veins (V), and perineural invasion (P) is given. Besides this other information like tumour grading or tumour scores (e.g. Gleason score for prostate cancer) might be given. With the used text mining approach the overall TNM codification term (as shown in the example) is identified with an F-score (harmonic mean of precision and recall) of 94%. Almost no false codes are extracted and the main recall error is the partial recognition of the TNM terms. The interpretation of the main TNM classes could be interpreted correctly with an F-score of 82. Currently we are conducting an error analysis of the whole training corpus to identify additional spelling variants of TNM classifications to further optimize the performance. In parallel a machine learning (ML) system is trained to compare rule based and ML based interpretation.

Discussion: The described solution enables automatic transformation of TNM codification in unstructured text to structured information. The identification of the whole TNM term works almost without errors and the automatic interpretation of the TNM terms gives back highly accurate values for the main classes. The described approach is flexible and easy to adapt to new hospitals and departments. Currently the automatic process is still manually validated. First comparisons of the automatic data mining approach to the annotations done by students indicate that the results are at least as good as manual annotations. In a next step it will be used to transfer the pathology data (in a first step only the main TNM classes) from the old to the new documentation system. Such kind of processes help to automatize the ETL processes within a hospital even with information in unstructured text resources like pathology reports.

Gliederung

References

1.: UICC. TNM Classification. Available from: http://uicc.org/resources/tnm
2.: Griebel L, Leb I, Christoph J, Laufer J, Marquardt K, Prokosch HU, Toddenroth D, Sedlmayr M. Cloud-Architektur für die datenschutzkonforme Sekundärnutzung strukturierter und freitextlicher Daten. Proceedings of the eHealth2013. 2013. S. 59-64.

gms | German Medical Science

GMDS 2014: 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Artikel

Extraction of TNM Codification from German Pathology Reports

Suche in Medline nach

Autoren

Gliederung

Text

References