Article
Assessing Data Quality while Processing Medical Data Exported from Clinical Systems: A Step-wise Approach
Search Medline for
Authors
Published: | September 15, 2023 |
---|
Outline
Text
Introduction: Secondary uses of data collected during primary care can significantly increase the breadth and depth of information available for medical research [1]. Thus, the provision of high-quality medical data from university hospitals to researchers is one of the main objectives of the German Medical Informatics Initiative (MII) [2]. However, the quality of these data typically suffers from several problems, including sparsity, heterogeneity, and lack of structure. Additionally, the verifiability of data quality (DQ) is highly dependent on the export and integration strategy. Finally, better DQ leads to higher possibilities of data usage and is crucial in medical informatics. Thus, the question is: where are the best entry points to measure DQ?
Methods: As DQ assessment framework, we used the criteria proposed in [3]. We identified which parts of our Extract-Transport-Load (ETL) pipeline can assess DQ. We detected two cascaded ETL processes that are used for technical and syntactical integration, semantic harmonization, and saving the data in the staging area. Additionally, DQ checks can be performed in a third ETL process used to provide the stored data to researchers for specific use scenarios. Finally, the researcher is also capable of implementing the DQ assessments in the analysis. Moreover, we agreed on the following (secondary) optimization goals: (1) save process time and disk space if possible and (2) assess DQ as early as possible.
Results: Since not all exported data contain a complete clinical dataset, only value conformance and completeness can be checked in the first two ETL processes. It makes no difference in the results in which of these two ETL processes DQ is assessed. In the third ETL process, all DQ assessments can be performed. Full DQ assessment by researchers is not useful because it would cost too much programming time. Using the predefined rules, we concluded to check the value conformance and completeness in the first ETL process while harmonizing and integrating the data. As a result, any other DQ assessment is performed in the last ETL process.
Discussion: The DQ API [4] is used to implement DQ assessments. Identified DQ issues are corrected if possible, and data will be discarded if this is not manageable. Therefore, we must differentiate between missing and discarded data. We defined the first rule due to disk space and process time issues when dealing with exported clinical data. The second rule was introduced to minimize the influence of ETL processes on DQ. In the end, maximum DQ is provided.
Conclusion: The possible aspects of DQ, which can be measured in the first two ETL processes, are highly dependent on the export strategy of the data. Exports containing only updates can be neither checked for relational and computational conformance, nor atemporal and temporal plausibility, since previous data is unavailable. Exports containing all versions of data entities cannot be checked for uniqueness. It is important to be aware that the structure of the data can vary heavily, making it necessary to carefully select the optimal point for assessing certain aspects of DQ.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, et al. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc. 2007;14(1):1-9. DOI: 10.1197/jamia.M2273
- 2.
- Semler SC, Wissing F, Heyder R. German Medical Informatics Initiative. Methods Inf Med. 2018;57(S 01):e50-e56. DOI: 10.3414/ME18-03-0003
- 3.
- Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC). 2016;4(1):1244. DOI: 10.13063/2327-9214.1244.
- 4.
- Spengler H, Gatz I, Kohlmayer F, Kuhn KA, Prasser F. Improving data quality in medical research: A monitoring architecture for clinical and translational data warehouses. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS); 2020 Jul 28-30. IEEE; 2020. p. 415-420. DOI: 10.1109/CBMS49503.2020.00085