gms | German Medical Science

GMDS 2014: 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

07. - 10.09.2014, Göttingen

Summarization of EHR using information extraction, sentiment analysis and word clouds

Meeting Abstract

Suche in Medline nach

  • Y. Deng - Universität Leipzig, Leipzig
  • K. Denecke - Universität Leipzig, Leipzig

GMDS 2014. 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Göttingen, 07.-10.09.2014. Düsseldorf: German Medical Science GMS Publishing House; 2014. DocAbstr. 135

doi: 10.3205/14gmds067, urn:nbn:de:0183-14gmds0672

Veröffentlicht: 4. September 2014

© 2014 Deng et al.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.de). Er darf vervielfältigt, verbreitet und öffentlich zugänglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.


Gliederung

Text

Introduction and Research Questions: When Patients are monitored over a long period of time, at each appointment it is crucial to get an overview on the progress and changes in the patient status promptly. Such information is documented in the clinical narratives. With the proceeding of the medical treatment, the volume of these narratives for each patient increases rapidly. The large amount of patient data can easily overwhelm the processing capability of the physicians. The overflow of patient records may therefore lead to the following practical problems: First, it is increasingly difficult for the physicians to get a rapid overview of the patient status. Second, the physicians can only search for the patient records through keyword or Boolean query. The inquiry of semantic aspects such as symptoms, opinions, intentions, and judgment is impossible. Third, the summarization of patient status and treatment procedures is highly labor-intensive and time consuming. In particular, the writing of the discharge summary of differential diagnoses requires still the perusing of all the pervious judgments and diagnoses.

In order to handle the “big data” problem for clinical records and offer the physician a swift access to the patient status, we have defined a novel processing pipeline based on information extraction, sentiment analysis, word clouds and summarization technologies. The feasibility of the approach is evaluated through linguistic analysis and user studies.

Materials and Methods: Clinical Information Extraction: The first step of the processing pipeline is the extraction process which forms the basis for the further analysis. First, linguistic elements, sentiment terms and medical terminology are extracted. Linguistic elements include numbers, stop words, punctuation and part of speeches (noun, verb, adjective, adverb, pronoun, etc.), which are the surface symbols from the text. As next, the sentiment terms in clinical narratives are obtained through dictionary taggers based on Subjectivity Lexicon [1]. The sentiment terms indicate a patient’s situation, e.g. ”malignant” shows negative outcome, while “benign” represents a positive result of a clinical investigation. Medical terminology referring to symptoms, diseases and anatomical concepts are identified and matched with concepts of standard medical terminology. We have used MetaMap [2] to map the document to UMLS [3] concepts. The extracted information is exploited by two different analysis methods described in the following.

Sentiment Analysis: Clinical sentiments are distributed in the recommendation, suggestion, and judgments in the differential diagnosis, suspicion, operation result, treatment outcomes. The impression from nurses on a patient’s health status documented in nurse letters or similar text are also considered. Our system exploits a voting algorithm to calculate the number of positive and negative terms as well as the negations at the document level, so that the polarity categories (positive, negative, neutral) of a document can be assigned.

Word Clouds: We used the OpenCloud [4] tool to generate the word clouds from clinical documents. They can be seen as summary of the relevant aspects of a document or document set. For our study, the word tags are selected from a text (1) just by frequency (Bag of words), or (2) based on their part of speech (POS). A first type of tags (Bag of words) is generated using all the words of a document except stop words. The 50 most frequent tokens are shown in the word cloud with size depending on their frequency in the document. All the tags are rendered with same color. To generate the second type of tags, the part of speech tokens from the extraction step are deployed into the word clouds. We intuitively highlight the nouns, verbs, adjectives with three primary colors (red, yellow and blue) due to the decisive roles of these lexical categories in the meaning delivery.

Finally, the patient status can be presented to the physicians using word clouds enriched by semantic aspects. The word clouds provide an overview of records, while the extraction result offers detailed information such as laboratory value, opinions, recommendations, and judgments.

Materials and Evaluation: 200 nurse letters and 200 radiological reports from Physio.net [5] and 200 interview texts from technical weblog as benchmark were chosen to evaluate the feasibility of the sentiment analysis. Accuracy of the approach was determined. Further, word clouds are evaluated based on six patient records suffering from problems at the cervical spine. For this purpose, questionnaires were filled by physicians to assess the usefulness, relevance of the words and the suitability of this kind of content representation. Three assistant physicians from neurosurgical department have taken part in the evaluation.

Result: The comparison shows that clinical narratives contain a comparable amount of subjective terms: 6% of the terms in nurse letters and 4% of the words in radiology reports were part of the subjectivity lexicon, while 8% of the terms in normal technical weblog could be matched with the subjectivity lexicon. Using the voting algorithm based on the annotated data set, nurse letter and radiology report has achieved moderate accuracy with 0.42 and 0.44 respectively, while for the normal technical weblog the algorithm has reached 0.69 of accuracy.

The three physicians admitted the usefulness of word clouds and the relevance of the emphasis through the size of the words were also well accepted. However, as we assumed, only with the word clouds, the details of the records cannot completely be presented to the physician.

Discussion: According to the experiments and feedback from the physicians, the improvements should concentrate on the extraction phase: first, the proximity-based extraction is required, since the most desirable information exists in the special sections such as impression, conclusion. For both word clouds and sentiment analysis, the processing should also be based on text from most representative part of the records. Second, the relations between the entities should be created, so that a navigation between text entities as well as to multimedia entities can be enabled. Third, the color for the word clouds should be adapted under the help of usability experts.


References

1.
Subjectivity Lexicon. http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/ [Accessed on 27.03.2014] Externer Link
2.
MetaMap. http://metamap.nlm.nih.gov/ [Accessed on 27.03.2014] Externer Link
3.
UMLS. Unified Medical Language System. Bethesda, MD: National Library of Medicine; 2013.
4.
OpenCloud. http://opencloud.mcavallo.org/ [Accessed on 27.03.2014] Externer Link
5.
Physionet. http://www.physionet.org/ [Accessed on 27.03.2014] Externer Link