Article
An improved metadata schema to support data quality reporting in R
Search Medline for
Authors
Published: | September 15, 2023 |
---|
Outline
Text
Introduction: The validity of scientific research results depends heavily on the data quality. Data quality analyses typically involve gauging data quality indicators by comparing observed data properties with formalized expectations, often in the form of metadata. Such metadata can be quite diverse, ranging from the number of expected observations for the entire data set to properties of individual variables, such as data type or inadmissible values. Here, we present the extension of a metadata schema used by the dataquieR R package [1] that eases the generation of reusable data quality reports.
State of the art: We based the metadata schema on a formal data quality framework for observational studies [2], which covers more extensively specific attributes usable for data quality analyses than other available data models and standards do (e.g., such as OBiBa Opal, REDCap, FHIR, etc.). The original metadata schema consisted of a single table to provide expectations about single variables (i.e., distributional assumption, inadmissibility limits), descriptions about the variables (e.g., variable names, variable and value labels) and information to control the report output (e.g., the order of variables). Furthermore, it was possible to specify contradiction rules from a set of predefined templates.
Concept: Our metadata extension organizes the information in a structured form across multiple tables to increase the versatility of data quality assessments. The schema contains the initial item level metadata table (i.e., single variable level). In addition, there are separate tables for the segment (e.g., physical examinations, laboratory measurements) and data frame levels (e.g., datasets provided in different files) to better reflect the possible tiers of information in an epidemiological study. Moreover, a cross-item level table facilitates the joint use of two or more variables, for example to control the calculation of multivariate outliers.
Implementation: We arranged each metadata table as a spreadsheet in a workbook facilitating user input of metadata directly in the spreadsheet or by specifying the source file for a specific item (e.g., another spreadsheet or a URL). We extended the item level metadata to include a reference to a standardized reference vocabulary and missing code tables, as well as information to control the calculation of outliers and unexpected distributions. The new metadata schema also allows users to enter reference tables for participant IDs at the different study levels. Additionally, contradiction rules can now be freely specified in a more readable way.
Lessons learned: The new metadata schema allows computing an extended range of quality indicators, including unexpected data elements (variables) and records, duplicates, ID mismatch, rates for non-response, refusal and drop-out, univariate and multivariate outliers, unexpected location, shape, scale and proportion. The metadata schema provides the necessary information to assess 24 data quality indicators out of the 34 defined in the referred framework [2]. Handling metadata in dedicated spreadsheets empowers users to evaluate data quality and generate reports without requiring extensive programming. Furthermore, the assumptions used to conduct data quality analyses become transparent and interoperable. Hence, the FAIRness of the data quality analyses increases.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Richter A, Schmidt CO, Krüger M, Struckmann S. dataquieR: assessment of data quality in epidemiological research. J Open Source Softw. 2021;6(61):3093. DOI: 10.21105/joss.03093
- 2.
- Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21(1):63. DOI: 10.1186/s12874-021-01252-7.