gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

Assessment of data quality in observational studies: a concept-driven approach using R

Meeting Abstract

  • Adrian Richter - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany
  • Stephan Struckmann - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany
  • Achim Reineke - Leibniz-Institut für Präventionsforschung und Epidemiologie – BIPS, Bremen, Germany
  • Martin Junge - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany
  • Carsten Oliver Schmidt - Institut für Community Medicine, Universitätsmedizin Greifswald, Greifswald, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 229

doi: 10.3205/19gmds012, urn:nbn:de:0183-19gmds0129

Published: September 6, 2019

© 2019 Richter et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Data quality assessments should be tailored to different data sources such as primary or secondary data collections [1]. The software language R offers versatile options to address related needs [2], [3].

Background: Observational studies with primary data collections are designed for research. The data generating process is under control by the responsible scientists and allows for a wide scope of applicable measures to assess and monitor data quality. However, to date, no set of R functions or R package has been released based on a conceptual data quality framework for primary data collections.

Concept: The targeted data quality concept focuses on intrinsic data quality, i.e. quality which can be assessed without contextual information of a specific research question [4]. For example, completeness and correctness of data from primary collections can be examined without consideration of a specific context. Correctness is further divided into consistency (to identify definitely incorrect or inadmissible data) and accuracy (to identify likely incorrect data). For each of these quality dimensions a core set of R-implementations has been developed and is applied in the sequence completeness, consistency, and accuracy. Common to all R-functions is the use of an augmented data dictionary (DD) which describes the expected characteristics of the collected study data. For example, missing codes, distributional type, and the plausibility limits of measurements are denoted as metadata in the DD.

Implementation: A public website was generated using R Markdown [5] which introduces/illustrates the conceptual approach. It provides simulated study data with dozens of reproducible data distortions, metadata (DD) describing the study data, and an extensive documentation of R-functions which serves as a guide for their application. The latter differs from typical documentations of R-packages in addressing the perspective of a user that examines data quality. Refactoring R-functions to be built as an R-package are currently under work.

Conclusion: The current set of R-functions allows for an assessment of a core set of data quality indicators in observational studies. Users are guided from examinations of completeness to the consistency and accuracy of the data. Free access to the functions is assured via a website. Currently, the approach is implemented in R but may be translated to other software languages as well.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Keller S, Korkmaz G, Orr M, Schroeder A, Shipp S. The evolution of data quality: Understanding the transdisciplinary origins of data quality concepts and approaches. Annual Review of Statistics and Its Application. 2017 Mar 7;4:85-108.
2.
Ryu C. dlookr: Tools for Data Diagnosis, Exploration. Transformation. 2019:5.
3.
validate: Data Validation Infrastructure. R package version 026. 2018 [Accessed 16 July 2019] Available from: https://cran.r-project.org/web/packages/validate/index.html External link
4.
Wang RY, Strong DM. Beyond accuracy: What data quality means to data consumers. Journal of management information systems. 1996 Mar 1;12(4):5-33.
5.
rmarkdown: Dynamic Documents for R. R package version 1.9. 2018. [Accessed 16 July 2019] Available from: https://cran.r-project.org/web/packages/rmarkdown/index.html External link