gms | German Medical Science

62. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

17.09. - 21.09.2017, Oldenburg

Content Analysis of free-text Nursing Reports using Topic Models

Meeting Abstract

  • Lukas Huber - UMIT - Private Universität für Gesundheitswissenschaften, Medizinische Informatik und Technik GmbH, Hall in Tirol, Österreich
  • Alexander Hörbst - Private Universität für Gesundheitswissenschaften, Medizinische Informatik und Technik, Hall in Tirol, Österreich
  • Franz Rauchegger - tirol kliniken GmbH, Innsbruck, Österreich
  • Werner Hackl - UMIT, Hall in Tirol, Österreich

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 62. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Oldenburg, 17.-21.09.2017. Düsseldorf: German Medical Science GMS Publishing House; 2017. DocAbstr. 167

doi: 10.3205/17gmds193, urn:nbn:de:0183-17gmds1937

Published: August 29, 2017

© 2017 Huber et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at



Introduction: Nursing Documentation is captured in different formats and contains valuable information concerning progression and status of patients. For secondary use purposes mainly data elements from structured documentation is used. But also the content from free-text documentation can be a valuable source of information. But, data preparation and analysis is more complex for free-text data. The goal of this work is to demonstrate a possible way of analyzing free-text reports and to present the insights gained from the analysis of nursing reports in an understandable and interactive way using LDA topic models.

Methods: A systematic analytics guideline [1] was used for the analysis. The phases of this approach are *Define* the problem, the search for topics contained in nursing narratives. *Import* the data from the data-source. *Tidy*, which was data cleaning and general preprocessing. This was followed by the *Explore* phase, which comprises: *Transform* the data in a suitable format to a feasible *Model*. *Visualize* the resulting model. In the following phases *Use* and *Communicate* the results are fed back to the potential users.

24574 nursing reports from 1114 stays in three university hospital wards were extracted and manually anonymized. After cleaning steps data was transferred to a document-term-matrix. Common German stop-words, punctuation, cases, numbers were removed and all words were stemmed.

For the modeling step Latent Dirichlet Allocation Models [2] were used due to their ability to group text in a systematic manner by producing topics which are groups of "coherent" terms following a specific distribution. To fit the models Gibbs sampling was used with 2000 iterations and models were fitted form 10 to 210 topics in steps of 20. For visualization and creation of an interactive model-representation the R package LDAVis [3] was used.

Results: The final model comprises 90 topics and 243 terms due to the limited sparsity of total 13380 terms. This shows the large variety of language used and the small common subset. The most frequent terms were "patient", "gives", "bed", "good". The model groups terms occurring together in documents, but word order is neglected. An interatctive visualization was the result of this step. Resulting topics are scaled in two dimensions where topic similarity determines distance. The analysis reveiled interesting insights in documentation habits and also redundant or superfluous documentation.

Discussion: The widespread use of textual data can be gentrified using text mining techniques for analysis and learning. The resulting model shows a well suited representation of nursing activities within the reports. The visualized topics are logically structured and understandable to nursing professionals. Such models can be used for documentation training, for reveiling superfluous documentation or to detect billable services not contained in structured documentation. This was a first attempt and the manual anonymization was tedious. For succession projects we investigate if this step may be negligible due to sparsity filtering. A future step could be to extend the approach for hierarchical topic models to get further insights into topic correlations.

Die Autoren geben an, dass kein Interessenkonflikt besteht.

Die Autoren geben an, dass kein Ethikvotum erforderlich ist.


Huber LM. Predictive models for multivariate data sets in the medical domain. University Innsbruck; 2017.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2012;3:993–1022. DOI: 10.1162/jmlr.2003.3.4-5.993 External link
Sievert C, Shirley K. LDAvis: Interactive Visualization of Topic Models. (n.d.). Retrieved from External link