gms | German Medical Science

68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

17.09. - 21.09.23, Heilbronn

dataquieR 2.0 — Improved Functionality for Data Quality Reporting

Meeting Abstract

  • Stephan Struckmann - Universitätsmedizin Greifswald, Greifswald, Germany
  • Joany Mariño - Universitätsmedizin Greifswald, Greifswald, Germany
  • Elisa Kasbohm - Universitätsmedizin Greifswald, Greifswald, Germany
  • Carsten Oliver Schmidt - Universitätsmedizin Greifswald, Greifswald, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). Heilbronn, 17.-21.09.2023. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocAbstr. 153

doi: 10.3205/23gmds082, urn:nbn:de:0183-23gmds0821

Published: September 15, 2023

© 2023 Struckmann et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Data quality assessments (DQA) are crucial for finding and handling data errors, making data comparable and FAIR. However, DQA require efficient generation and communication of results. Different concepts to classify data quality aspects exist, e.g. [1], [2], [3], as well as R software implementations [4]. This abstract focuses on the recently updated dataquieR package.

State of the art: Among the most prominent DQA tools in R based on a DQ framework are DQAstats by Kapsner et al. and dataquieR. In version 1.0.13, dataquieR’s concept coverage was 18 out of 34 indicators, which is one of the highest, but still far from 100% [4].

DQA tools typically create reports (e.g., PDF or HTML files). Usually, these are not interactive, so inspecting single outputs may be unsatisfying: zooming may be insufficient, and finding the most relevant results in linear documents can be cumbersome. However, such documents have few preconditions for sharing the results as no web-server is needed. dataquieR employed rmarkdown by Xie to create HTML documents because of its flexibility. Nonetheless, it is a generic implementation, slowing down the report generation time without good options to improve speed. Furthermore, dataquieR used parallel computing only for selected indicators, resulting in a complicated process.

Concept: dataquieR now expands the previous HTML approach for reporting with improved interactive dashboards. A re-implementation of the report rendering allowed to include more features of the DT package by Xie, e.g., exporting tables as Excel™. Functions for reproducibility have also been added. Graphical results are stored as ggplot2 objects; to present them in a report, dataquieR now employs plotly by Sievert. This results in highly interactive figures and enables zooming, panning, hiding specific parts of the figure, and exporting them as single graphics in several formats, including vector graphics.

With the new release, dataquieR’s concept coverage increased from 52.9% to 70.6%, without increasing computation time. Most prominently, for each report output, the respective indicator function call is now included as R code, allowing to re-generate the respective output outside the reporting pipeline, which poses a huge advantage for more user specific assessments.

Implementation: In the report pipeline, the previous dataquieR version called some functions with more than one outcome variable at once, generating complicated, encapsulated result lists. Now, the way dataquieR calls data quality indicator functions has been rewritten so that dataquieR calls each function for each outcome variable separately. As a consequence, the result is a simple list structure (with one entry for each indicator function and outcome variable) and all functions can run in parallel. This modification allowed for more generic rendering functions featuring, e.g., plotly support. Alongside this, the report rendering re-implementation with htmltools by Cheng, instead of rmarkdown, resulted in a leaner reporting engine.

Lessons learned: Improving the visualization of data quality indicators and adding new functionality to reuse the code enhances the usability, reproducibility, and speed of dataquieR’s pipeline.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC). 2016;4(1):1244. DOI: 10.13063/2327-9214.1244 External link
2.
Nonnemacher M, Nasseh D, Stausberg J. Datenqualität in der medizinischen Forschung. Vol. 4. Medizinisch Wissenschaftliche Verlagsgesellschaft; 2014.
3.
Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Medical Research Methodology. 2021;21(1):1–15.
4.
Mariño J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Applied Sciences. 2022;12(9):4238.