Quantifying readability and vocabulary metrics of the Austrian National Health Portal

  • Richard Zowalla - Hochschule Heilbronn - Medizinische Informatik, Heilbronn, Deutschland
  • Martin Wiesner - Hochschule Heilbronn - Medizinische Informatik, Heilbronn, Deutschland

Published: August 27, 2018

Introduction: Even with a higher level of health literacy, vocabulary and concepts of diagnosis and treatment may not be easy to understand for lay people [1]. A low level of patient health literacy represents a major reason for worse prognosis, less preventive actions or reduced therapy adherence [2]. In this context, the German Federal Ministry of Health outlined the concept for a “National Health Portal” in 2018 [3]. In their analysis of existing work, experts of the IQWiG checked several existing portals of other official sources, e.g., the ‘Public Health Portal of Austria‘ (P-HPA,

Yet, to the best knowledge of the authors, no full-scale assessment of the health-related articles published at the P-HPA is publicly available. The aim of this study is to fill this gap. We present the results of a computer-based readability and vocabulary analysis of the P-HPA content, as published in 2018.

Methods: Readability is a term to describe the properties of written text. It can be checked for with different instruments [4]. Several metrics have been adapted for the German language, e.g., Flesch-Reading Ease (FRE) or Vienna-Formula (4th-WSTF, [5]). Beyond readability, modern methods from the field of Machine Learning can be leveraged [6] to compute an expert level (L=\'7b1,…,10\'7d, [7]) which relates to the vocabulary used.

All articles were downloaded via the crawler4j framework; text content was extracted and HTML-sanitized via JSoup and cleaned from disturbance artifacts. For every article, the readability metrics were computed via an analysis software written in Java. For sentence detection, we relied on OpenNLP and related sentence models for the German language.

Results: The analysis included n=2931 articles as found in the sub-categories (i) diseases (n=914), (ii) laboratory and diagnosis (n=993) (iii) life (n=1024), mainly on prevention and aging. The data acquisition was carried out on April 5th, 2018.

For WSTF, the analysis yields an average of 12.42 corresponding to 13 years in school. The mean FRE score resulted in a value of 22.74 and showed a mean understandability value of L=5.93 which corresponds to a moderate level of vocabulary difficulty.

Articles in sub-category ‘life’ are written in easier vocabulary (L=3.31) and require attending school for 11 years (WSTF=11.55;FRE=31.21). Information on ‘diseases’ score a moderate L (L=6.18) and the reader needs 13 years in school (WSTF=12.32;FRE=22.77). By contrast, ‘laboratory’ articles make use of complex sentence structures (WSTF=13.35;FRE=14.44) and difficult vocabulary (L=8.26).

Note: The full results of the analysis are published at as supplemental material.

Discussion: As shown by our analysis, the expert level, i.e. the vocabulary required by readers, seems moderate on average: L<=6 but varies largely depending on the topic. Therefore, article authors should carefully check written material and reduce expert-centric terms. Automatic tooling, as used for this study, can support in the process of text production.

Yet, an exclusive computation of readability scores does not reflect the individual knowledge or the motivation of patients or family members. Moreover, aspects such as illustration and type-setting were not assessed in this study.

