gms | German Medical Science

63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

02. - 06.09.2018, Osnabrück

PROV@TOS, a Java Wrapper to capture provenance for Talend Open Studio jobs

Meeting Abstract

  • Marcel Parciak - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Deutschland
  • Christian R. Bauer - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Deutschland
  • Robert Lodahl - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Deutschland
  • Caroline Thoms - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Deutschland
  • Harald Kusch - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Deutschland
  • Sabine Rey - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Deutschland
  • Ulrich Sax - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Deutschland

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018. DocAbstr. 197

doi: 10.3205/18gmds096, urn:nbn:de:0183-18gmds0961

Published: August 27, 2018

© 2018 Parciak et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Clinical studies in medicine aim to derive knowledge from growing amounts of diverse datasets. Utilisation of this data frequently necessitates data integration processes, which directly affects the quality of the research outcome. Increasing transparency and reproducibility of these processes supports trust in the outcomes and enables meta-analysis of the integration process. To achieve this, provenance – a record of the creation, transformation and all other influences regarding an object [1], [2] – can be captured and shared [3], [4], [5], establishing an integral part in complying to the FAIR principles [6].

A commonly used tool to integrate data is Talend Open Studio for Data Integration (TOS). We aimed to enhance TOS to make it provenance-aware, in order to capture fine-grained provenance without modification of the data integration pipelines themselves.

Methods: To model and store provenance, W3C-PROV is applied [1]. The PROV core concept involves entities, activities, agents and their inter-relations. Although it is possible to extend PROV to tailor it to the needs of specific domains, we decided to use PROV without extensions.

As part of our data curation toolset, TOS is central to our data integration pipelines, enabling us to capture fine-grained provenance at a coordination-point [7]. Using visual components and connectors TOS generates executable Java code (a job). Although the created jobs can vary in their functionality, their general structures remain similar.

In order to make TOS provenance-aware, the jobs are modified to capture provenance using a data-model provided by ProvToolbox [8], making serializations like XML and RDF available. We aimed at simplicity of use: existing jobs should not be modified.

Results: To capture provenance from running TOS jobs we introduce PROV@TOS, a Java wrapper that executes exported TOS jobs and stores standardized provenance information. A TOS component is modelled as an activity, input/output data are entities and influencers like the Java version are modelled as agents. Start and ending times of components are recorded in HashMaps from the Java Util library. Extended HashMaps have been implemented to store relevant provenance data which are injected using the Java Reflection API. If possible, entities will be accessed and identified using the hash of the referenced file. PROV@TOS is openly available at https://gitlab.gwdg.de/medinfpub/tos/provAtTos.

Discussion: PROV@TOS has been successfully tested on different well established TOS jobs within our department in different projects [9]. We plan to put it into productive use within the UMG medical data integration center (UMG-MeDIC), making our data integration processes provenance-aware wherever TOS is used. By extending inherent TOS components, no further job modifications have to be performed. By using „plain vanilla“ W3C-PROV, tools like PROV-O-VIZ [10] were readily available to visualize the output data. Identifying entities using the file-hash enables provenance stitching within the whole integration pipeline [11]. In the future, we plan to extend this feature to a data store providing persistent unique identifiers to avoid collisions caused by identical file-hashes.

The PROV@TOS based metadata extension of TOS jobs has the potential to increase transparency and reproducibility of research data.

Acknowledgements: This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the research and funding concepts of the Medical Informatics Initiative (01ZZ1802B/HiGHmed), i:DSem (031L0024A/MyPathSem) and the DFG for the Collaborative Research Centre 1002 on Modulatory Units in Heart Failure, subproject INF.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Moreau L, Missier P. PROV-DM: The PROV Data Model. 2013 [accessed April 09, 2018]. http://www.w3.org/TR/2013/REC-prov-dm-20130430/ External link
2.
Pérez B, Rubio J, Sáenz-Adán C. A systematic review of provenance systems. Knowl Inf Syst. 2018:1-49. DOI: 10.1007/s10115-018-1164-3 External link
3.
Ragan ED, Endert A, Sanyal J, Chen J. Characterizing Provenance in Visualization and Data Analysis: An Organizational Framework of Provenance Types and Purposes. IEEE Transactions on Visualization and Computer Graphics. 2016;22:31-40. DOI: 10.1109/TVCG.2015.2467551 External link
4.
Baum B, et al. Opinion paper: Data provenance challenges in biomedical research. It - Information Technology. 2017;59:191–196. DOI: 10.1515/itit-2016-0031 External link
5.
Parciak M. Provenancekonzept für Datenbestände aus einer heterogenen Forschungsinfrastuktur (am Beispiel einer klinischen Forschergruppe) [Master Thesis]. Georg-August-Universität Göttingen; 2017.
6.
Wilkinson MD, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;3. DOI: 10.1038/sdata.2016.18 External link
7.
Chapman A, Blaustein BT, Seligman L, Allen MD. PLUS: A provenance manager for integrated information. In: 2011 IEEE International Conference on Information Reuse Integration; 3-5 Aug 2011; Las Vegas, NV, USA. IEEE; 2011. p. 269–75. DOI: 10.1109/IRI.2011.6009558 External link
8.
Moreau L. ProvToolbox: Java library to create and convert W3C PROV data model representations. 2017 [accessed April 09, 2018]. https://github.com/lucmoreau/ProvToolbox External link
9.
Bauer CRKD, Ganslandt T, Baum B, Christoph J, Engel I, Löbe M, et al. Integrated Data Repository Toolkit (IDRT). A Suite of Programs to Facilitate Health Analytics on Heterogeneous Medical Data. Methods Inf Med. 2016;55:125–35. DOI: 10.3414/ME15-01-0082 External link
10.
Hoekstra R, Groth P. PROV-O-Viz - Understanding the Role of Activities in Provenance. In: Provenance and Annotation of Data and Processes. 5th International Provenance and Annotation Workshop, IPAW 2014, Cologne, Germany, June 9-13, 2014. New York: Springer; 2015. (Lecture Notes in Computer Science; 8628). p. 215–220. DOI: 10.1007/978-3-319-16462-5_18 External link
11.
Missier P, et al. Linking multiple workflow provenance traces for interoperable collaborative science. In: 5th Workshop on Workflows in Support of Large-Scale Science, New Orleans, LA, USA. IEEE; 2010. DOI: 10.1109/WORKS.2010.5671861 External link