gms | German Medical Science

63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

02. - 06.09.2018, Osnabrück

FAIR conform ETL processing in translational research

Meeting Abstract

  • Theresa Bender - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Deutschland
  • Christian R Bauer - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Deutschland
  • Marcel Parciak - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Deutschland
  • Robert Lodahl - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Deutschland
  • Ulrich Sax - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Deutschland

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018. DocAbstr. 254

doi: 10.3205/18gmds095, urn:nbn:de:0183-18gmds0951

Veröffentlicht: 27. August 2018

© 2018 Bender et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Clinical trial data usually requires some kind of ETL (Extract, Transform, Load) process to be available for further analyses beyond the originating data capture system, transforming and altering data in several ways like cleaning or enriching. The umbrella term provenance summarizes changes generated by these processes. Data provenance describes domain-specific changes to data, workflow or process provenance label a generic way to model changing processes [1]. Following Good Clinical Practice (GCP) [2], both types of provenance are indispensable for reproducibility and the quality of subsequent analyses [3], [4].

Recently, the FAIR Guiding Principles [5] have been developed as a guideline to enable research data to be Findable, Accessible, Interoperable and Re-usable. FAIR focusses on data provenance, mentioning workflow provenance just briefly.

We implemented workflows preparing clinical trial data for further analyses considering those principles, aiming to describe them in a GCP conforming, sustainable manner.

Methods: The application of our local research data management pipeline [6] involved creation and maintenance of multiple ETL workflows. The majority was implemented in the open source software Talend Open Studio for Data Integration (TOS) which generates Java code utilizing an Eclipse-based GUI (https://www.talend.com/).

Git (https://git-scm.com/) was used to address workflow provenance of TOS. However, Git support for TOS workflows (Jobs) is only given in commercial versions. Thus, they were checked in at a local GitLab (https://gitlab.com/) instance as individual “item exports”, containing all user created content. Every Job was distributed in a separate Git repository with versions pushed parallel to the internal TOS versioning. Commit messages were used to reference the current TOS software version.

Furthermore, a readme file on the root level of the project was maintained, explaining configuration and usage of the provided source code, while additional needed files outside TOS were referenced or uploaded additionally.

Results: Our local ETL processes are uploaded to GitLab on a regular basis. Every project has its own project ID which is unique within the instance. Additionally, every commit is provided with a unique identifier. Hence, a reference to a specific state of the source code can be made. Specific authentication and location parameters (e.g. database username, IP address) are stored separately in TOS as a context. The collaborative creation of TOS Jobs is a feasible and now often applied method in our department.

Discussion: The presented approach is a first step towards FAIR ETL workflows. The GitLab implementation provides a robust versioning and access restriction basis for our approach. Creating contexts for authentication and location parameters is labor intensive and has to be supported by a guideline for all TOS creators. Since the item export feature of TOS also exports the user-specific context, a separation of this data from general workflow properties has to be considered in the utilization of GitLab and its user access implementation, e.g. by using additional Git repositories.

This approach will be evaluated in several research projects, in order to assess the contribution to transparency and reproducibility on the one hand, usability and the burden to the users on the other hand.

Acknowledgements: This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the research and funding concepts e:Med (01ZX1606C/sysINFLAME), i:DSem (031L0024A/MyPathSem) and the framework of the research and funding concepts of the Medical Informatics Initiative (01ZZ1802B/HiGHmed).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Herschel M, Hlawatsch M. Provenance: On and Behind the Screens, Proceedings of the 2016 International Conference on Management of Data. 2016:2213-7. DOI: 10.1145/2882903.2912568. Externer Link
2.
World Health Organization. Guidelines for good clinical practice (GCP) for trials on pharmaceutical products, WHO Technical Report Series. 1995;(850):97–137.
3.
Ragan ED, Endert A, Sanyal J, Chen J. Characterizing Provenance in Visualization and Data Analysis. An Organizational Framework of Provenance Types and Purposes, IEEE transactions on visualization and computer graphics. 2016;22(1):31–40. DOI: 10.1109/TVCG.2015.2467551. Externer Link
4.
Katerbow M, Feulner G. Recommendations On The Development, Use And Provision Of Research Software. 2018. DOI: 10.5281/zenodo.1172988. Externer Link
5.
Wilkinson MD, Dumontier M, Aalbersberg I, Jsbrand J, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship, Scientific data. 2016;3. DOI: 10.1038/sdata.2016.18. Externer Link
6.
Bauer CR, Umbach N, Baum B, Buckow K, Franke T, Grütz R, et al. Architecture of a Biomedical Informatics Research Data Management Pipeline, Stud Health Technol Inform. 2016;228:262–6.