gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Using structured ETL-information for validating data integration

Meeting Abstract

Suche in Medline nach

  • Erik Tute - Peter L. Reichertz Institut für Medizinische Informatik der Technischen Universität Braunschweig und der Medizinischen Hochschule Hannover, Hannover, Germany
  • Matthias Gietzelt - Peter L. Reichertz Institut für Medizinische Informatik der Technischen Universität Braunschweig und der Medizinischen Hochschule Hannover, Hannover, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 246

doi: 10.3205/20gmds196, urn:nbn:de:0183-20gmds1966

Veröffentlicht: 26. Februar 2021

© 2021 Tute et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Background: The German Medical Informatics Initiative addresses the goal of developing infrastructure for the integration of clinical data from patient care and medical research. HiGHmed as part of this initiative builds up medical data integration centers integrating various clinical data to make it accessible for reuse [1]. Data integration is often referred to as ETL-process representing three phases: Extracting data from the source, Transforming it into the required form and Loading it into the target database. Since implementing an ETL-process is often a complex task, errors occur and may not be apparent. Thus, ETL-validation is advisable (cf. [2], [3]). Objective of this work is to present findings from using structured information about applied transformations during ETL-process to validate data integration.

Methods: The setting was a study on predictive biomarkers for rejections of kidney transplants. The study's dataset was integrated into an openEHR based data repository for further analysis, dissemination and reuse using an open source tool for data integration [4]. A developed web-app converts the integrated data into a comparable form for source data verification (SDV). Source data (CSV-files) and integrated data were compared and the ETL-process refined based on SDV-findings until SDV revealed no more issues. Finally, an additional data quality (DQ) assessment applying summary statistics finalized the ETL-validation.

Results: The SDV functionality automatically converts the integrated data into a form comparable to the source data using the structured information for the ETL-tool that describe the transformations to apply on variables. SDV compares the data row-by-row, highlighting discrepancies with colors and additionally providing summaries per column. Green color indicates identical values, yellow indicates differing values that are plausible given the transformation instructions for the variable and red color indicates unexpected differences. Summary statistics show the number of differing values and meta-information such as the variable transformations. SDV revealed a number of issues in the ETL-process that were not apparent using manual inspection leading to a number of iterative ETL-process improvements. An example for a non-obvious issue was uploading the data with dates in ISO 8601 format without time zone indicator leading to valid but shifted dates. Final DQ-assessment based on summary statistics revealed a DQ-issue not discovered using SDV: A small number of values used a different value to indicate a null-flavor (“ND”) not treated as one in ETL. Since, source data and integrated data both showed the same values SDV did not indicate an issue but summary statistics revealed the undesired “ND” value.

Conclusion: SDV based on structured information describing transformations is useful to indicate data integration issues. If structured information on applied transformations is available, SDV is a low cost and worthwhile validation method. Although, one might rashly assume that comparing source and integrated data is some kind of gold standard, it is not sufficient alone and needs to be complemented with other methods since some types of DQ-issue remain hidden. The finding that ETL-process issues are likely to be found when starting ETL-validation conforms to findings from literature (e.g. [3], [5]).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Haarbrandt B, Schreiweis B, Rey S, Sax U, Scheithauer S, Rienhoff O, et al. HiGHmed – An Open Platform Approach to Enhance Care and Research across Institutional Boundaries. Methods Inf Med. 2018;57:e66–81.
2.
Kahn MG, Brown JS, Chun AT, Davidson BN, Meeker D, Ryan PB, et al. Transparent reporting of data quality in distributed data networks. EGEMS (Wash DC). 2015;3:1052. DOI: 10.13063/2327-9214.1052 Externer Link
3.
Khare R, Utidjian LH, Razzaghi H, Soucek V, Burrows E, Eckrich D, et al. Design and Refinement of a Data Quality Assessment Workflow for a Large Pediatric Research Network. EGEMS (Wash DC). 2019;7:36. DOI: 10.5334/egems.294 Externer Link
4.
Tute E, Haarbrandt B. Integrating relational data into clinical information model based data repositories. In: Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie, Hrsg. 62. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Oldenburg, 17.-21.09.2017. Düsseldorf: Medical Science GMS Publishing House; 2017. DOI: 10.3205/17gmds145 Externer Link
5.
Welch G, Recklinghausen FV, Taenzer A, Savitz L, Weiss L. Data cleaning in the evaluation of a multi-site intervention project. EGEMS (Wash DC). 2017;5:4. DOI: 10.5334/egems.196 Externer Link