Artikel
Aspects of data warehousing pipeline improvements: Data quality, connectivity, performance
Suche in Medline nach
Autoren
Veröffentlicht: | 15. September 2023 |
---|
Gliederung
Text
Introduction: Clinical data warehouses play an important role in providing medical data for project-specific research. One of the most commonly used clinical data warehouses is i2b2 (Informatics for Integrating Biology & the Bedside) [1]. Since medical data often show a high degree of heterogeneity on multiple levels, the process of data harmonization during extraction, transformation and loading (ETL) can be difficult. Thus, we enhanced an existing ETL pipeline for easier data integration into i2b2.
State of the art: A variety of ETL tools exist for loading data into i2b2, e.g., tranSMART batch [2]. However, these tools come with several limitations. Two of the most common problems are the level of code maintenance and lack of compatibility with newer i2b2 versions, as well as structural issues resulting in a big preprocessing overhead. Some of these issues were addressed by Spengler et al. [3]. However, during the integration of our data into i2b2 we discovered potential for further improvements for this ETL pipeline [3].
Concept: We identified three main challenges which we addressed and improved upon: (1) handling data quality (DQ) issues in the source data, (2) support for new target and source systems, and (3) overall performance. Concerning DQ, we iteratively re-loaded and discussed our data sets with physicians and researchers, thereby fine-tuning our pipeline. We also added options to log information about the DQ. This gives researchers additional information concerning their data and also leads to easier data correction. In terms of new target and source systems, we added support for newer i2b2 versions and made it possible to use a data lake as source system. Finally, the biggest issue we improved upon concerning performance was long loading times for big data sets.
Implementation: We added new features to handle DQ problems, e.g., an option to correct empty timestamps for otherwise valid data. We also incorporated checks in the pipeline to measure DQ, using the DQ API [4]. The ETL pipeline now supports newer i2b2 versions until 1.7.13. The pipeline automatically detects the version of the target system and transforms the data accordingly. We extended the possible input sources with the option to connect to a data lake. Thus, JSON data blocks and modified HL7v2 messages can also be processed. In terms of performance, we greatly reduced the loading times by transforming data directly into the i2b2 database schema. Additionally, we now offer options for incremental data loading, suitable for data updates or gradual loading of big data sets, and the deletion of small data subsets, e.g., in case of consent withdrawal.
Lessons learned: We have integrated a multitude of data sets into i2b2 instances, e.g., data from neurological diseases like MS, PD, and stroke, microbiome research data, and billing data. Handling of DQ issues differs: While some issues can be corrected, others should only be logged. We enabled our ETL pipeline (accessible under [5]) to integrate data sets from various sources in a quick and easy manner, and we will continue to empirically evaluate and maintain it.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 2010;17(2):124–30. DOI: 10.1136/jamia.2009.000893
- 2.
- TheHyve. tranSMART Batch [Internet]. 2014 [cited 2023 Apr 26]. Available from: https://github.com/thehyve/transmart-batch
- 3.
- Spengler H, Lang C, Mahapatra T, Gatz I, Kuhn KA, Prasser F. Enabling Agile Clinical and Translational Data Warehousing: Platform Development and Evaluation. JMIR Med Inform. 2020;8(7):e15918. DOI: 10.2196/15918
- 4.
- Spengler H, Gatz I, Kohlmayer F, Kuhn KA, Prasser F. Improving Data Quality in Medical Research: A Monitoring Architecture for Clinical and Translational Data Warehouses. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS); 2020 Jul 28-30. p. 415–420. DOI: 10.1109/CBMS49503.2020.00085
- 5.
- DIFUTURE. ETL Pipeline [Internet]. GitLab; 2023 [cited 2023 Jun 07]. Available from: https://gitlab.com/DIFUTURE/etl-pipeline/-/tree/gmds-2023