gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

Modularised Programming to Reduce Complexity and Enhance Reusability in Data Quality Assessments

Meeting Abstract

Search Medline for

  • Stephan Struckmann - Universitätsmedizin Greifswald, Greifswald, Germany
  • Jörg Henke - Universitätsmedizin Greifswald, Greifswald, Germany
  • Carsten Oliver Schmidt - Universitätsmedizin Greifswald, Greifswald, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 159

doi: 10.3205/22gmds017, urn:nbn:de:0183-22gmds0173

Published: August 19, 2022

© 2022 Struckmann et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at



Introduction: For assessing data quality (DQ; i.e., the degree to which data fits requirements), data dictionaries are a valuable source of information at the variable level. This is the case because metadata in data dictionaries specifies data requirements (e.g., measurement limits, expected probability distributions, or the assignments of process variables, e.g. from examiners to outcome variables) that enable automated DQ reporting [1], [2].

However, different machine-readable formats are used for data dictionaries, including spreadsheet formats with pre-specified columns [1], [2], [3], XML formats [4], [5], generic databases (PostgreSQL, MySQL [2], or study specific databases such as Opal [3]). This work shows how a modular architecture of an R analysis pipeline facilitates using metadata and data from different sources to enable DQ assessments (DQA).

Methods: R packages have been implemented as part of the Square2 DQA web application [2], [6] to create configurable DQ reports from various data sources, such as Opal servers, spreadsheet files, and database management systems. One module has been developed for each supported data source and output format. Each module implements interface methods for all the applicable steps, i.e., reading, evaluating, and writing, of each input (data, metadata, report configuration) and output (analysis results).

Results and discussion: The DQ pipeline can be controlled from the graphical user interface in Square2, where users can further configure data and metadata sources and the output (e.g., variable selection, comments, and layout). Afterwards, a template-based report is generated using RMarkdown [7], flexdashboard [8], and dataquieR [1].

Based on experiences with DQA in SHIP [9] and NAKO [10], and by reviewing similar pipelines from other groups, the presented software targets typical steps, data-sources and output formats in DQA workflows.

The modular approach represents a major advantage in enabling comparable DQ reports when using data from different sources. No additional preprocessing steps by the user are necessary. Furthermore, the modular architecture facilitates extensions. New data sources and data formats for input/output can be installed as additional modules. The architecture also allows reusing the modules in other DQA pipelines, for example when manually creating RMarkdown reports.

Die Autoren geben an, dass kein Interessenkonflikt besteht.

Die Autoren geben an, dass kein Ethikvotum erforderlich ist.


Richter A, Schmidt CO, Krüger M, Struckmann S. dataquieR: assessment of data quality in epidemiological research. Journal of Open Source Software. 2021;6(61):3093.
Schmidt CO, Krabbe C, Schössow J, Albers M, Radke D, Henke J. Square2-A Web Application for Data Monitoring in Epidemiological and Clinical Studies. Studies in health technology and informatics. 2017;235:549-53.
Doiron D, Marcon Y, Fortier I, Burton P, Ferretti V. Software Application Profile: Opal and Mica: open-source software solutions for epidemiological data management, harmonization and dissemination. Int J Epidemiol. 2017;46(5):1372-8.
Clinical Data Interchange Standards Consortium. Data Exchange. 2020. Available from: External link
HL7 FHIR. Documentation Index. 2019. Available from: External link
Institut für Community Medicine SoHiPS. Square2webapp. 2022. Available from: External link
Allaire JJ, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, et al. rmarkdown: Dynamic Documents for R. R package version 1.9. 2018.
Iannone R, Allaire J, Borges B. flexdashboard: R Markdown Format for Flexible Dashboards. 2020.
Völzke H, Alte D, Schmidt CO, Radke D, Lorbeer R, Friedrich N, et al. Cohort profile: the study of health in Pomerania. International journal of epidemiology. 2010;40(2):294-307.
Bamberg F, Kauczor HU, Weckbach S, Schlett CL, Forsting M, Ladd SC, et al. Whole-body MR imaging in the German National Cohort: rationale, design, and technical background. Radiology. 2015;277(1):206-20.