gms | German Medical Science

63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

02. - 06.09.2018, Osnabrück

Reproducibility in bioinformatics pipelines: the KNIME-Docker approach

Meeting Abstract

  • Blanca Flores - Universität Heidelberg, Heidelberg, Deutschland
  • Dirk Hose - Universitätsklinikum Heidelberg, Heidelberg, Deutschland
  • Anja Seckinger - Universitätsklinikum Heidelberg, Heidelberg, Deutschland
  • Petra Knaup - Universität Heidelberg, Heidelberg, Deutschland
  • Matthias Ganzinger - Universität Heidelberg, Heidelberg, Deutschland

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 63. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Osnabrück, 02.-06.09.2018. Düsseldorf: German Medical Science GMS Publishing House; 2018. DocAbstr. 166

doi: 10.3205/18gmds089, urn:nbn:de:0183-18gmds0899

Veröffentlicht: 27. August 2018

© 2018 Flores et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Analysis of genomic data in bioinformatics pipelines supporting diagnosis and treatment decisions is becoming common practice in research and clinical applications. The complexity of processes and changing nature of software involved makes reproducibility of results challenging and time-consuming [1]. Based on the development of a software framework for the reporting of gene expression data using DNA-microarrays [2], a RNA-Seq based pipeline was established to support decisions by summarizing risk scores for multiple myeloma patients in a semi-automatic report within the project “clinically applicable, omics-based assessment of survival, side effects and targets in multiple myeloma” (CLIOMMICS). In previous work, we discussed challenges of different bioinformatics analysis software and reviewed approaches to support reproducibility [3]. In this study, we aimed to implement a reproducible pipeline following a hybrid approach.

Methods: Our four main workflow tasks were data retrieval, statistical data analysis, data querying, and visualization of results. We combined the advantages of two open source software packages: KNIME [4], to model the pipeline via a graphical workflow, and Docker [5], to preserve configurations of the dependent software. We used the KNIME Analytics Platform to create a pipeline workflow by integrating nodes for software such as R, PostgreSQL and BIRT. The CSV Reader node served for data input at different stages. To execute the R-script that calculated risk scores, we used the R Snippet node provided by community nodes. The R system itself resides in a Docker-container conserving a specific package configuration that can be shared with other users. For database querying, we used nodes for PostgreSQL database management. Finally, to map output data we used KNIME Report Designer based on the Eclipse BIRT Designer.

Results: Following the design of an existing report, we created a template in KNIME Report Designer with 28 data sets mapped as output data, including risk scores. From our database, we queried relevant data from 7 tables and mapped it into the report. We uploaded the pipeline to the KNIME Server version 4.3.2 making it accessible to other users and possible to execute remotely from the IMBI Liferay portal. When the workflow is executed, a PDF document is generated in form of the CLIOMMICS report.

Discussion: KNIME allowed us to combine data retrieval, analysis, and visualization steps, and to integrate necessary software, providing a unifying framework for data analysis and flow control [6]. This is especially important in modeling bioinformatics pipelines, as even using different software, it is possible to handle version dependencies and execute old versions of workflows that produce the same results. Because software dependencies vary over time with updates and new features, Docker allows us to keep images of software configurations, along with version management to keep track of changes, and Docker-files, which document installations and environment details. Although this is the first version of the pipeline modelled in KNIME, we expect with our approach to execute this and other workflows in controlled environments, supporting reproducibility for future implementations.

Acknowledgements: CLIOMMICS is funded by German Ministry of Education and Research (e:Med initiative). Grant id:01ZX1609A.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Boettiger C. An introduction to Docker for reproducible research, with examples from the R environment. ACM SIGOPS Oper Syst Rev. 2014;49(1):71–9.
2.
Meissner T, et al. Gene expression profiling in multiple myeloma - reporting of entities, risk and targets in clinical routine. Clin cancer Res an Off J Am Assoc Cancer Res. 2011;17(23):7240–7.
3.
Flores B, Hose D, Seckinger A, Knaup P, Ganzinger M. From bench to bedside: A view on bioinformatics pipelines. Stud Health Technol Inform. 2017;245:375–8.
4.
Berthold M, et al. KNIME - The Konstanz information miner. SIGKDD Explor.Newsl. 2009;11(1):26–31.
5.
Docker Inc. “What is a container”. San Francisco; 2018.
6.
Fillbrunn A, Dietz C, Pfeuffer J, Rahn R, Landrum GA, Berthold MR. KNIME for reproducible cross-domain analysis of life science data. J Biotechnol. 2017;261(Feb):149–56.