Article
Comparing the attributes of health research data models and standards for conducting data quality assessments
Search Medline for
Authors
| Published: | September 6, 2024 |
|---|
Outline
Text
Introduction: Many health standards and tools are tailored to capture, describe, and manage data. Across these cases, evaluating data quality (DQ) requires comprehensive metadata describing the expectations and requirements on the data. Creating such metadata is complex and time-consuming; hence, knowing the potential of popular standards and tools for hosting metadata is important. Here, we gauge the ability of their underlying data models (DM) to provide metadata attributes and evaluate their potential for data quality assessments (DQA) according to a formal DQ framework (DQ-OBS) [1].
Methods: We focused on widely used standards and tools in healthcare and research: CDISC Define-XML, HL7® FHIR®, OHDSI OMOP Common Data Model, openEHR, OpenClinica®, and REDCap®. These were selected by experts from the National Research Data Infrastructure for Personal Health Data (NFDI4Health) and the German Medical Informatics Initiative (MII) because of their relevance in health sciences. As a reference, we chose DQ-OBS due to its development based on consented DQ indicators, organized over Integrity, Completeness, Consistency, and Accuracy dimensions. We collaborated with clinicians and medical informaticians with expertise in the tools and standards, interoperability, eHealth, and DQ. We held two meetings with all authors, followed by several individual meetings and written exchanges to discuss DM attributes and their mapping to DQ-OBS. Attributes preventing DQ issues during data entry or capture (e.g., in electronic health records or case report forms, without differentiating between them) were included if they could be stored and reused. The match between each DQ-OBS indicator and the DM attributes was systematically checked for plausibility using the corresponding documentation.
Results: Among the DQ dimensions, attributes to compute Completeness and Consistency indicators are well represented across DMs, Integrity is mostly addressed, and Accuracy is poorly targeted. None of the standards or tools include information to handle mismatches between data sets. Few indicators are trivial to compute because metadata is unnecessary (e.g., crude missingness can be calculated if missing codes are defined in the data). OpenClinica, REDCap, and FHIR prevent DQ issues during data entry using definitions entered before data collection. The only dedicated software for DQA are OHDSI’s DataQualityDashboard and REDCap’s DQ module. All standards allow defining custom rules; however, these require technical knowledge and are standard-specific, as the DMs are not interoperable. Nevertheless, related but independent open-source software facilitates entering and validating rules (e.g., software from CDSIC COSA for Define-XML, ocRuleTool for OpenClinica). Similarly, other software allows extracting and transforming data from the respective DMs into R or Python, which is crucial as the data can then be reused for DQA (e.g., fhircrackr, pyEHR, openehR, or REDCapR).
Conclusions: All assessed DMs offer possibilities for DQA. However, their main focus is formal checks on data completeness and consistency, while information relevant for evaluating data integrity and accuracy requires extension. Developing DMs and tools that include this additional metadata and the DQ results would greatly enhance research transparency. Once such metadata is in place, assessments could be done efficiently and uniformly using generic DQ tools [2]. These developments would improve the potential for harmonized DQA and reproducible research.
Clair Blacketer is an employee of Janssen Research & Development, LLC and holds stock and stock options.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC medical research methodology. 2021 Dec;21:1-5. DOI: 10.1186/s12874-021-01252-7
- 2.
- Mariño J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R packages for data quality assessments and data monitoring: a software scoping review with recommendations for future developments. Applied Sciences. 2022 Apr 22;12(9):4238. DOI: 10.3390/app12094238
