gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

Enhancing translational research projects and patient care with ETL pipelines for genomic and clinical data

Meeting Abstract

  • Jonas Hügel - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Göttingen, Germany; Campus-Institute Data Science (CIDAS), Göttingen, Göttingen, Germany
  • Nils Beyer - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Göttingen, Germany
  • Theresa Bender - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Göttingen, Germany
  • Lennart Graf - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Göttingen, Germany
  • Sophia Rheinländer - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Göttingen, Germany
  • Ulrich Sax - Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Göttingen, Germany; Campus-Institute Data Science (CIDAS), Göttingen, Göttingen, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 128

doi: 10.3205/22gmds049, urn:nbn:de:0183-22gmds0498

Veröffentlicht: 19. August 2022

© 2022 Hügel et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Many translational research projects in cancer research are utilizing genomic data directly from either patients or patient-derived models [1], [2], [3], [4]. Due to its sensitivity, the data needs to be stored in secured network segments, but must still be available for researchers and clinicians. Additionally, more and more genomic data is being used in patient care [1] e.g. for therapy boards. Therefore, it is a challenging but important task to enhance those projects by making this data available via ETL processes. In the Molecular Tumorboard (MTB)-Report project and the CRU5002 we aim to establish pipelines to integrate clinical and genomic data in data marts such as tranSMART [2] and cBioPortal [5], [6], Figure 1.

Methods: Across projects, we developed different ETL methods to extract and store genomic data. In the MTB-Report project, we import variant data directly from files on a molecular pathology server into the cancer documentation system Onkostar. The import is triggered manually in an Onkostar form, which sends the required patient information to a server in a secured network segment with reading access to the pathology server. It copies the corresponding files onto the server, stores the required extracted variant information in a CSV file, and imports it into Onkostar via a REST API. In the CRU5002 the FASTQ files are imported via a web interface into a CDSTAR [5] data lake instance and further created data are stored in a SEEK [7], [8] instance in a secured network segment to make the data Findable, Accessible, Interoperable and Reusable (FAIR [4], [9], [10]).

We frequently exported a subset of the Onkostar data to import it via a script into a tranSMART instance. During this process, we enriched the CRU5002 data with genomic data. Moreover, we are currently working on an import into our local cBioPortal instance.

Results: We imported clinical and genomic data from about 300 patients in MTB-Report as well as clinical data for 90 patients in the CRU5002. Since February 2022, we imported around 15 patients every two weeks into Onkostar. Before, this data had been inserted manually. Additionally, the data imported into Onkostar is further used in the MTB at the UMG. Moreover, it reduced the time for documenting the variants of a patient from an hour to minutes.

Discussion: Making the genomic and clinical data in the CRU5002 “FAIR” generates benefits for all researchers in the project.

Additionally, the automated import into Onkostar increased the data quality for the MTB, since the study nurse no longer needs to enter it manually. Alongside the actual data, we see the urgent need to add metadata about data sources, pipelines, and information models. One example was the export of analyzed panel data in the same data format whereas the change in the interpretation from using HG19 to HG38 led to huge differences [11].

Conclusion: Developing ETL processes enhances translational research projects by increasing the data quality and data provenance [4]. This improves results in the research projects, as well as in medical care for patients.

Funding: This work is partially funded by the VolkswagenStiftung within the MTB-Report project (ZN3424) and by the DFG within the CRU5002.

Acknowledgements: We would like to thank Kirsten Reuter-Jessen, Maximilian Papendick, Nicklas Lück, Elisabeth Heßmann and Marius Brunner for their support.

The authors declare that they have no competing interests.

The authors declare that a positive ethics committee vote has been obtained.


References

1.
Chute CG, Ullman-Cullere M, Wood GM, Lin SM, He M, Pathak J. Some experiences and opportunities for big data in translational research. Genet Med. 2013 Oct;15(10):802–9.
2.
Scheufele E, Aronzon D, Coopersmith R, McDuffie MT, Kapoor M, Uhrich CA, et al. tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform. AMIA Jt Summits Transl Sci Proc AMIA Jt Summits Transl Sci. 2014;2014:96–101.
3.
Prasser F, Kohlbacher O, Mansmann U, Bauer B, Kuhn K. Data Integration for Future Medicine (DIFUTURE): An Architectural and Methodological Overview. Methods Inf Med. 2018 Jul;57(S 01):e57–65.
4.
Parciak M, Bender T, Sax U, Bauer CR. Applying FAIRness: Redesigning a Biomedical Informatics Research Data Management Pipeline. Methods Inf Med. 2019 Dec;58(06):229–34.
5.
Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal. Sci Signal. 2013 Apr 2;6(269):pl1. DOI: 10.1126/scisignal.2004088 Externer Link
6.
Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discov. 2012 May;2(5):401–4.
7.
Wolstencroft K, Owen S, Krebs O, Nguyen Q, Stanford NJ, Golebiewski M, et al. SEEK: a systems biology data and model management platform. BMC Syst Biol. 2015 Dec;9(1):33.
8.
Wolstencroft K, Krebs O, Snoep JL, Stanford NJ, Bacall F, Golebiewski M, et al. FAIRDOMHub: a repository and collaboration environment for sharing systems biology research. Nucleic Acids Res. 2017 Jan 4;45(D1):D404–7.
9.
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018.
10.
Mons B, Neylon C, Velterop J, Dumontier M, da Silva Santos LOB, Wilkinson MD. Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Inf Serv Use. 2017 Mar 7;37(1):49–56.
11.
Pan B, Kusko R, Xiao W, Zheng Y, Liu Z, Xiao C, et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics. 2019 Mar;20(S2):101.