gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

The Way Data Flows: Current Provenance Options in Collaborative Research

Meeting Abstract

  • Christian Henke - Department of Medical Informatics, University Medical Center Göttingen (UMG), Göttingen, Germany
  • Lennart Graf - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
  • Alessandra Simone Kuntz - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
  • Ulrich Sax - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
  • Matthias Löbe - Institute for Medical Informatics, Statistics and Epidemiology (IMISE), Leipzig, Germany
  • Hannes Ulrich - IT Center for Clinical Research, University of Lübeck, Lübeck, Lübeck, Germany; Institute of Medical Informatics, University of Lübeck, Lübeck, Lübeck, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 205

doi: 10.3205/22gmds023, urn:nbn:de:0183-22gmds0236

Veröffentlicht: 19. August 2022

© 2022 Henke et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: In collaborative research, data are generated at different sites and in different environments. Metadata about the original source, the information model, and extraction and transfer methods are of paramount interest to enable trustworthy further processing [1], [2]. We have been able to identify several suitable standards and toolkits, but we do not yet see broad adoption of provenance information within the existing models [3]. This work examines the question of whether the models of selected current German research infrastructures contain enough information to meaningfully provide a provenance core dataset.

Methods: As part of the NMDR project we developed a provenance core dataset to describe the provenance of datasets. We evaluated existing data models and standards to determine a first best of breed model for the provenance vocabulary. This model contains 19 data fields.

This provenance core dataset was evaluated against data models of current research projects in Germany. We assessed the data models in two steps. First, we checked which items of our provenance vocabulary could be gathered directly from an exported dataset. In the second step, the resulting provenance datasets were enhanced by narrative interviews with the transfer office staff and data stewards. We focused our investigation on the top-level provenance data for the (exported) data sets/data models.

Results: Our investigation showed that we were able to extract some of the provenance data fields in each evaluated data model but not even close to our entire provenance core data set. A Network University Medicine COVID-19 Data Exchange Platform (NUM CODEX) Dataset provides 6 of 19 fields. A National Pandemic Cohort Network (NAPKON) Dataset contains 7 of 19 fields; a clinical trials dataset includes 5 of 19 fields directly in the data model or exported dataset. Further narrative interviews revealed that the transfer office staff can provide additional provenance information. For the NUM CODEX dataset, these were 9 additional fields (total: 15 of 19 fields), for the NAPKON dataset, we got 4 additional fields (total: 11 of 19 fields), and for the clinical trials dataset, further 11 items were provided (total: 16 of 19 fields).

Discussion: In evaluating our Provenance core dataset, we discovered that the information that can be extracted directly from data models or exported datasets is insufficient for meaningful provenance data. However, there is already some provenance metadata available, and we encountered that further provenance information is implicitly available at the data transfer offices. This information should also be added to the data models and exported datasets in the future to improve the integration and exploitability of provenance information.

Conclusion: Provenance is a vital source to trust in received data sets. Much of the required provenance metadata exists but needs to be added to the records to make the provenance data available. Our analysis and core data set are the first step to establishing a better representation of this much-needed information - but still, future improvements are required.

Funding: This project was funded partially by the DFG through projects NMDR (IN 50/3-2, FKZ, SA 1009/3-2, WI 1605/10-2).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Ragan ED, Endert A, Sanyal J, Chen J. Characterizing provenance in visualization and data analysis: an organizational framework of provenance types and purposes. IEEE transactions on visualization and computer graphics. 2016;22(1):31–40.
2.
Wilkinson  MD, Dumontier M, Mons B, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016;3:160018. DOI: 10.1038/sdata.2016.18 Externer Link
3.
Parciak M, Bauer CR, Baum B, Kusch H, Sax U. Technical Aspects of Data Provenance in Clinical Trials. In: Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie, editor. 62. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Oldenburg, 17.-21.09.2017. Düsseldorf: German Medical Science GMS Publishing House; 2017. DocAbstr. 288. DOI: 10.3205/17gmds155 Externer Link