gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

Insights from a scoping review on data quality assessments using R

Meeting Abstract

  • Elisa Kasbohm - Universitätsmedizin Greifswald, Institut für Community Medicine, Greifswald, Germany
  • Joany Mariño - Universitätsmedizin Greifswald, Institut für Community Medicine, Greifswald, Germany
  • Stephan Struckmann - Universitätsmedizin Greifswald, Institut für Community Medicine, Greifswald, Germany
  • Lorenz Kapsner - Universitätsklinikum Erlangen, Medizinisches Zentrum für Informations- und Kommunikationstechnik (MIK), Erlangen, Germany
  • Carsten Oliver Schmidt - Universitätsmedizin Greifswald, Institut für Community Medicine, Greifswald, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 204

doi: 10.3205/22gmds022, urn:nbn:de:0183-22gmds0221

Veröffentlicht: 19. August 2022

© 2022 Kasbohm et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: The quality of research data is crucial for any study in medical research and should be assessed efficiently and comprehensively. Packages in the programming language R [1] are of particularly high relevance for this purpose, but a systematic comparison of their functionalities had not been conducted before. Therefore, we conducted a scoping review to identify R packages of relevance for data quality assessments, to assess their scope against a reference data quality framework, and to detect gaps which should be addressed in future developments. We present key results of the review.

Methods: R packages related to data quality were identified by a systematic search in the Comprehensive R Archive Network (CRAN) [2], [3] and from the literature [4], [5], [6], [7], [8], [9]. Based on available documentation and test runs using example data from a cohort study, we evaluated the packages’ range of data quality indicators in reference to a data quality framework for observational health studies [10]. For this purpose, the functionalities of the packages were mapped against the four data quality dimensions (integrity, completeness, consistency, accuracy), ten domains (areas of data quality assessments that subdivide the four dimensions) and 34 indicators of the reference framework. We included active packages hosted on CRAN which cover at least three of the four data quality dimensions and four domains. Packages tailored to a specific type of data (e.g., RNA-sequencing data) were excluded from the assessment.

Results: We screened more than 140 R packages related to data quality, from which 27 were eligible for inclusion in our review. Only three packages follow a data quality concept (dataquieR: [10], DQAstats: [11], MOQA: [12]). The coverage of the framework differed strongly between packages. At most eight out of ten domains were covered by a single package (pointblank, dataquieR), four packages covered seven domains (DescTools, DQAstats, testdat, validate). Some domains were covered by the majority of packages, such as Crude missingness (missing values without taking the reason for missingness into account), Range and value violations, Unexpected distributions, and Value format errors. Only three packages handle missing value codes, and only one package provides checks for disagreements between repeated measurements. Packages that focus on descriptive output and data exploration often provide only few formal checks on unmet requirements, whereas other packages are based on targeted checks using metadata that is provided either by programming code or by separate files.

Discussion: Existing packages in R take very different approaches to assess data quality. Future developments should consider a more extensive metadata use, user-friendliness and error handling. The functionalities of the packages fit well into the reference data quality framework at the level of dimensions and domains, yet selected indicators need to be added.

Conclusion: Many packages are available in R to support data quality assessments and it is highly recommendable to routinely use them. Primarily, an improved use of metadata that represent requirements on the variables would expand the scope of possible data quality checks.

The scoping review was recently published in Applied Sciences, Special Issue “Data Science for Medical Informatics” [13].

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.
2.
The Comprehensive R Archive Network. [accessed 2022 May 23 ]. Available from: https://cran.r-project.org/ Externer Link
3.
Csárdi G, Salmon M. pkgsearch: Search and Query CRAN R Packages. R Package Version 3.0.3. 2020 [accessed 2022 Jan 18]. Available from: https://cran.r-project.org/package=pkgsearch Externer Link
4.
Bialke M, Rau H, Schwaneberg T, Walk R, Bahls T, Hoffmann W. mosaicQA - A General Approach to Facilitate Basic Data Quality Assurance for Epidemiological Research. Methods Inf Med. 2017;56(7):e67-e73. DOI: 10.3414/ME16-01-0123 Externer Link
5.
Kapsner LA, Mang JM, Mate S, Seuchter SA, Vengadeswaran A, Bathelt F, et al. Linking a Consortium-Wide Data Quality Assessment Tool with the MIRACUM Metadata Repository. Appl Clin Inf. 2021;12(04):826–835. DOI: 10.1055/s-0041-1733847. Externer Link
6.
Petersen AH, Ekstr\u248 ?m CT. dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R. J Stat Soft. 2019;90(6):1-38. DOI: 10.18637/jss.v090.i06 Externer Link
7.
Putatunda S, Ubrangala D, Rama K, Kondapalli R. SmartEDA: An R Package for Automated Exploratory Data Analysis. J Open Source Softw. 2019;4(41):1509. DOI: 10.21105/joss.01509 Externer Link
8.
Staniak M, Biecek P. The Landscape of R Packages for Automated Exploratory Data Analysis. R J. 2019;11(2):347–369. DOI: 10.32614/RJ-2019-033 Externer Link
9.
van der Loo MPJ, de Jonge E. Data Validation Infrastructure for R. J Stat Soft. 2021;97(10):1-31. DOI: 10.18637/jss.v097.i10 Externer Link
10.
Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21(1):63. DOI: 10.1186/s12874-021-01252-7 Externer Link
11.
Kapsner LA, Kampf MO, Seuchter SA, Kamdje-Wabo G, Gradinger T, Ganslandt T, et al. Moving Towards an EHR Data Quality Framework: The MIRACUM Approach. Ger Med Data Sci Shap Change – Creat Solut Innov Med. 2019:247–253. DOI: 10.3233/SHTI190834 Externer Link
12.
Nonnemacher M, Nasseh D, Stausberg J. Datenqualität in der medizinischen Forschung: Leitlinie zum adaptiven Management von Datenqualität in Kohortenstudien und Registern. 2., aktualisierte und erweiterte Auflage. Berlin: Medizinisch Wissenschaftliche Verlagsgesellschaft; 2014. DOI: 10.32745/9783954663743 Externer Link
13.
Mariño J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Appl Sci. 2022;12(9):4238. DOI: 10.3390/app12094238 Externer Link