gms | German Medical Science

63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

02. - 06.09.2018, Osnabrück

Development of a web-based platform for information extraction and full-text mining of medical findings of acute myeloid leukemia (AML)

Meeting Abstract

  • Hans Laser - Hannover Medical School, Centre of Information Management (ZIMt), Hannover, Deutschland
  • Svetlana Gerbel - Hannover Medical School, Centre of Information Management (ZIMt), Hannover, Deutschland
  • Felix Struckmann - University of Applied Sciences Hannover, Faculty III – Media, Information and Design, Hannover, Deutschland
  • Yannik Wissner - University of Applied Sciences Hannover, Faculty III – Media, Information and Design, Hannover, Deutschland
  • Iyas Hamwi - Hannover Medical School, Department of Hematology, Hemostasis, Oncology and Stem Cell Transplantation, Hannover, Deutschland
  • Michael Heuser - Hannover Medical School, Department of Hematology, Hemostasis, Oncology and Stem Cell Transplantation, Hannover, Deutschland
  • Christian Wartena - University of Applied Sciences Hannover, Faculty III – Media, Information and Design, Hannover, Deutschland

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018. DocAbstr. 241

doi: 10.3205/18gmds159, urn:nbn:de:0183-18gmds1598

Veröffentlicht: 27. August 2018

© 2018 Laser et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Acute myeloid leukemia (AML) is a rare (ORPHA: 519) disease characterized by a blasts concentration of ≥ 20% in bone marrow or peripheral blood. The research group for leukemia research at the Hannover Medical School deals with the treatment and research of the development of AML. Bone marrow findings in the form of free texts are an important source for leukemia research. A manual search in these findings involves a great amount of effort [1]. The goal of this project, which was carried out by an interdisciplinary team, was the development of an intuitively operated platform that enables a full text search for particularly relevant named-entities from the findings texts and extracts the relevant entities (e.g. blast fraction or ratio of the erythropoietic to granulocytic cell series) and makes it available for further processing [2].

Methods: For the purpose of information extraction, 12,744 data sets of bone marrow findings were provided from a clinical database.

To find numerical values for quantities such as '0.3:1' and percentages such as '5%', a standard pipeline involving sentence division, tokenization and Part-of-Speech (POS) tagging was used. This pipeline was implemented in Python using the Natural Language Toolkit (NLTK) and the Tree tagger for lemmatization and POS Tagging. The entities were identified by a list of relevant named-entities. The identification of the named-entities list has been complemented by a synonym list. The synonym list can be extended dynamically.

Regular expressions were used to determine quantities of numeric values or rough indications (‘almost none’, ‘low percentage’, etc.) for the entities. Because entity and related value could be disrupted through the structure of a sentence, candidates were classified letting the candidate with the highest score win.

By using Apache-SOLR, an API could be created that allows a full-text-search on the pre-processed texts and passes the results to a PHP application for display as a web-page.

To standardize and simplify the website that hosts the search engine, the bootstrap-framework was used as a design and technical paradigm.

Results: The developed full-text-search is operated via a web-based user-interface. 100% of the data were loaded into the SOLR-index. A corresponding synonym of a named-entity could be identified in 43% (approx. 5,460) of 12,744 records. Numeric values or indications could be made available in the search-result for 80% (approx. 4,365) of the records. The value of the quantity identified by the named-entity was displayed correctly, if the given rules allowed it.

A semi-automated pipeline was created to process the algorithm. The pipeline loads the data-sets, performs preprocessing using the methods developed, and makes the results available to the user-interface for full-text-query via the SOLR-API.

Discussion: The developed platform was successfully implemented for the working group. The synonym list remains extensible and can be continuously improved. It is to be expected that with additional synonyms significantly more than the 43% achieved can be identified.

Unfortunately, data heterogeneity in text mining continues to set limits. It could be significantly improved by establishing gold standards for primary medical documentation (templates).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Liu K, Mitchell KJ, Chapman WW, Crowley RS. Automating tissue bank annotation from pathology reports - comparison to a gold standard expert annotation set. AMIA Annu Symp Proc. 2005;:460-4.
2.
Ruch P, et al. Using lexical disambiguation and namedentity recognition to improve spelling correction in the electronic patient record. Artificial intelligence in medicine. 2003;29(1):169–84.