gms | German Medical Science

68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

17.09. - 21.09.23, Heilbronn

Aspects of data warehousing pipeline improvements: Data quality, connectivity, performance

Meeting Abstract

  • Ingrid Martin - Institute for Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
  • Viola Braunmüller - Medical Data Integration Center (meDIC), University Hospital Tübingen, Tübingen, Germany
  • Martin Boeker - Institute for Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, Medical Center rechts der Isar, Technical University of Munich, Munich, Germany
  • Helmut Spengler - Institute for Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, Medical Center rechts der Isar, Technical University of Munich, Munich, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). Heilbronn, 17.-21.09.2023. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocAbstr. 253

doi: 10.3205/23gmds135, urn:nbn:de:0183-23gmds1359

Veröffentlicht: 15. September 2023

© 2023 Martin et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Clinical data warehouses play an important role in providing medical data for project-specific research. One of the most commonly used clinical data warehouses is i2b2 (Informatics for Integrating Biology & the Bedside) [1]. Since medical data often show a high degree of heterogeneity on multiple levels, the process of data harmonization during extraction, transformation and loading (ETL) can be difficult. Thus, we enhanced an existing ETL pipeline for easier data integration into i2b2.

State of the art: A variety of ETL tools exist for loading data into i2b2, e.g., tranSMART batch [2]. However, these tools come with several limitations. Two of the most common problems are the level of code maintenance and lack of compatibility with newer i2b2 versions, as well as structural issues resulting in a big preprocessing overhead. Some of these issues were addressed by Spengler et al. [3]. However, during the integration of our data into i2b2 we discovered potential for further improvements for this ETL pipeline [3].

Concept: We identified three main challenges which we addressed and improved upon: (1) handling data quality (DQ) issues in the source data, (2) support for new target and source systems, and (3) overall performance. Concerning DQ, we iteratively re-loaded and discussed our data sets with physicians and researchers, thereby fine-tuning our pipeline. We also added options to log information about the DQ. This gives researchers additional information concerning their data and also leads to easier data correction. In terms of new target and source systems, we added support for newer i2b2 versions and made it possible to use a data lake as source system. Finally, the biggest issue we improved upon concerning performance was long loading times for big data sets.

Implementation: We added new features to handle DQ problems, e.g., an option to correct empty timestamps for otherwise valid data. We also incorporated checks in the pipeline to measure DQ, using the DQ API [4]. The ETL pipeline now supports newer i2b2 versions until 1.7.13. The pipeline automatically detects the version of the target system and transforms the data accordingly. We extended the possible input sources with the option to connect to a data lake. Thus, JSON data blocks and modified HL7v2 messages can also be processed. In terms of performance, we greatly reduced the loading times by transforming data directly into the i2b2 database schema. Additionally, we now offer options for incremental data loading, suitable for data updates or gradual loading of big data sets, and the deletion of small data subsets, e.g., in case of consent withdrawal.

Lessons learned: We have integrated a multitude of data sets into i2b2 instances, e.g., data from neurological diseases like MS, PD, and stroke, microbiome research data, and billing data. Handling of DQ issues differs: While some issues can be corrected, others should only be logged. We enabled our ETL pipeline (accessible under [5]) to integrate data sets from various sources in a quick and easy manner, and we will continue to empirically evaluate and maintain it.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 2010;17(2):124–30. DOI: 10.1136/jamia.2009.000893 Externer Link
2.
TheHyve. tranSMART Batch [Internet]. 2014 [cited 2023 Apr 26]. Available from: https://github.com/thehyve/transmart-batch Externer Link
3.
Spengler H, Lang C, Mahapatra T, Gatz I, Kuhn KA, Prasser F. Enabling Agile Clinical and Translational Data Warehousing: Platform Development and Evaluation. JMIR Med Inform. 2020;8(7):e15918. DOI: 10.2196/15918 Externer Link
4.
Spengler H, Gatz I, Kohlmayer F, Kuhn KA, Prasser F. Improving Data Quality in Medical Research: A Monitoring Architecture for Clinical and Translational Data Warehouses. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS); 2020 Jul 28-30. p. 415–420. DOI: 10.1109/CBMS49503.2020.00085 Externer Link
5.
DIFUTURE. ETL Pipeline [Internet]. GitLab; 2023 [cited 2023 Jun 07]. Available from: https://gitlab.com/DIFUTURE/etl-pipeline/-/tree/gmds-2023 Externer Link