Artikel
Insights from a scoping review on data quality assessments using R
Suche in Medline nach
Autoren
Veröffentlicht: | 19. August 2022 |
---|
Gliederung
Text
Introduction: The quality of research data is crucial for any study in medical research and should be assessed efficiently and comprehensively. Packages in the programming language R [1] are of particularly high relevance for this purpose, but a systematic comparison of their functionalities had not been conducted before. Therefore, we conducted a scoping review to identify R packages of relevance for data quality assessments, to assess their scope against a reference data quality framework, and to detect gaps which should be addressed in future developments. We present key results of the review.
Methods: R packages related to data quality were identified by a systematic search in the Comprehensive R Archive Network (CRAN) [2], [3] and from the literature [4], [5], [6], [7], [8], [9]. Based on available documentation and test runs using example data from a cohort study, we evaluated the packages’ range of data quality indicators in reference to a data quality framework for observational health studies [10]. For this purpose, the functionalities of the packages were mapped against the four data quality dimensions (integrity, completeness, consistency, accuracy), ten domains (areas of data quality assessments that subdivide the four dimensions) and 34 indicators of the reference framework. We included active packages hosted on CRAN which cover at least three of the four data quality dimensions and four domains. Packages tailored to a specific type of data (e.g., RNA-sequencing data) were excluded from the assessment.
Results: We screened more than 140 R packages related to data quality, from which 27 were eligible for inclusion in our review. Only three packages follow a data quality concept (dataquieR: [10], DQAstats: [11], MOQA: [12]). The coverage of the framework differed strongly between packages. At most eight out of ten domains were covered by a single package (pointblank, dataquieR), four packages covered seven domains (DescTools, DQAstats, testdat, validate). Some domains were covered by the majority of packages, such as Crude missingness (missing values without taking the reason for missingness into account), Range and value violations, Unexpected distributions, and Value format errors. Only three packages handle missing value codes, and only one package provides checks for disagreements between repeated measurements. Packages that focus on descriptive output and data exploration often provide only few formal checks on unmet requirements, whereas other packages are based on targeted checks using metadata that is provided either by programming code or by separate files.
Discussion: Existing packages in R take very different approaches to assess data quality. Future developments should consider a more extensive metadata use, user-friendliness and error handling. The functionalities of the packages fit well into the reference data quality framework at the level of dimensions and domains, yet selected indicators need to be added.
Conclusion: Many packages are available in R to support data quality assessments and it is highly recommendable to routinely use them. Primarily, an improved use of metadata that represent requirements on the variables would expand the scope of possible data quality checks.
The scoping review was recently published in Applied Sciences, Special Issue “Data Science for Medical Informatics” [13].
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.
- 2.
- The Comprehensive R Archive Network. [accessed 2022 May 23 ]. Available from: https://cran.r-project.org/
- 3.
- Csárdi G, Salmon M. pkgsearch: Search and Query CRAN R Packages. R Package Version 3.0.3. 2020 [accessed 2022 Jan 18]. Available from: https://cran.r-project.org/package=pkgsearch
- 4.
- Bialke M, Rau H, Schwaneberg T, Walk R, Bahls T, Hoffmann W. mosaicQA - A General Approach to Facilitate Basic Data Quality Assurance for Epidemiological Research. Methods Inf Med. 2017;56(7):e67-e73. DOI: 10.3414/ME16-01-0123
- 5.
- Kapsner LA, Mang JM, Mate S, Seuchter SA, Vengadeswaran A, Bathelt F, et al. Linking a Consortium-Wide Data Quality Assessment Tool with the MIRACUM Metadata Repository. Appl Clin Inf. 2021;12(04):826–835. DOI: 10.1055/s-0041-1733847.
- 6.
- Petersen AH, Ekstr\u248 ?m CT. dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R. J Stat Soft. 2019;90(6):1-38. DOI: 10.18637/jss.v090.i06
- 7.
- Putatunda S, Ubrangala D, Rama K, Kondapalli R. SmartEDA: An R Package for Automated Exploratory Data Analysis. J Open Source Softw. 2019;4(41):1509. DOI: 10.21105/joss.01509
- 8.
- Staniak M, Biecek P. The Landscape of R Packages for Automated Exploratory Data Analysis. R J. 2019;11(2):347–369. DOI: 10.32614/RJ-2019-033
- 9.
- van der Loo MPJ, de Jonge E. Data Validation Infrastructure for R. J Stat Soft. 2021;97(10):1-31. DOI: 10.18637/jss.v097.i10
- 10.
- Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21(1):63. DOI: 10.1186/s12874-021-01252-7
- 11.
- Kapsner LA, Kampf MO, Seuchter SA, Kamdje-Wabo G, Gradinger T, Ganslandt T, et al. Moving Towards an EHR Data Quality Framework: The MIRACUM Approach. Ger Med Data Sci Shap Change – Creat Solut Innov Med. 2019:247–253. DOI: 10.3233/SHTI190834
- 12.
- Nonnemacher M, Nasseh D, Stausberg J. Datenqualität in der medizinischen Forschung: Leitlinie zum adaptiven Management von Datenqualität in Kohortenstudien und Registern. 2., aktualisierte und erweiterte Auflage. Berlin: Medizinisch Wissenschaftliche Verlagsgesellschaft; 2014. DOI: 10.32745/9783954663743
- 13.
- Mariño J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Appl Sci. 2022;12(9):4238. DOI: 10.3390/app12094238