gms | German Medical Science

GMDS 2012: 57. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

16. - 20.09.2012, Braunschweig

Enabling big data driven biomedicine using scientific workflows and cloud computing

Meeting Abstract

Search Medline for

  • Yassene Mohammed - Distributed Computing & Security (DCSec) Research Group and L3S, Leibniz Universität Hannover, Germany; High-Throughput Proteomics Group, Leiden University Medical Center, the Netherlands
  • Gabriele von Voigt - Distributed Computing & Security (DCSec) Research Group and L3S, Leibniz Universität Hannover, Germany
  • Matthew Smith - Distributed Computing & Security (DCSec) Research Group and L3S, Leibniz Universität Hannover, Germany

GMDS 2012. 57. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Braunschweig, 16.-20.09.2012. Düsseldorf: German Medical Science GMS Publishing House; 2012. Doc12gmds044

DOI: 10.3205/12gmds044, URN: urn:nbn:de:0183-12gmds0441

Published: September 13, 2012

© 2012 Mohammed et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en). You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.


Outline

Text

Introduction: Advances in biomedical instruments is rapidly reducing acquisition time, increasing data accuracy and consequently increasing the amount of data gathered significantly. This brings the big data challenges into the domain of computational medicine and bioinformatics. In this manuscript we describe one example for a methodology of handling big data in biomedicine using off-the-shelf open source tools to outsource and manage computationally intensive tasks. We demonstrate this for high-throughput proteomics, bioinformatics, and image processing [1].

Material and Methods: Stepwise processing of data is a trait of many biomedical data analysis applications, often requiring multiple pieces of software, algorithms, and data-formats. The different steps require varying computational capacities that can depend on results in previous steps. Grids have been used in the past to deal with big data, however the cumbersome usage patterns create difficulties in the agile big data environment [2]. Although some solutions like Galaxy [3] and the EBI services [4] offer scientists access to on demand tools, it is not trivial to include own algorithms in their processing pipeline.

The combination of Cloud computing and scientific workflow engines [5] enable the connection of modular processing steps, the automation of analysis pipelines, and importantly the sharing of analyses in a reproducible way. They facilitate the orchestration of decomposition, processing, and re-composition of big data between local and cloud resources. We have been using cloud resources available for research based on OpenNebula [6], and the scientific workflow engine Taverna [7].

Results: We used workflows both to imbed the logic for data analysis and to manage access to the different resources. Our cloud can include any machine to which the workflow engine has ssh access; i.e. a server, a desktop, Amazon EC2, or OpenNebula. Our scientific workflows for data analysis in proteomics include local format conversions using proprietary software, remote peptide identification using open source search engines, and local or remote statistical analysis using R. For instance, using decomposing and composing of data we searched 213788 scans (1.3GB mzXML) against the uniprot canonical sequence data of Homo-sapiens, with an error window of -0.5-2.5 Dalton and dynamic Carbamidomethyl modification. Using only idle campus Cloud resources the analysis was completed in 12.5 minutes (a 26 fold speedup compared to a local run).

Discussion: Computational methods are struggling to keep pace with the torrent of data produces by biomedical instruments. Big data demands faster, more agile, scalable and reliable solutions than local workstations or Grids. Data analysis is increasingly becoming problem specific and solutions cannot be generalized. Our experiments show that developing specific ad-hoc solutions based on open source scientific workflow engines and cloud infrastructures is a flexible and convenient solution to big data driven science. In our setup all data is encrypted during transport between the different campus Cloud resources. We are currently conducting trials with homomorphic cryptography to actually compute on encrypted data [8]. This will allow us to extend the Cloud to untrusted resources since the data and the algorithms are never decrypted and thus are safe throughout the entire process.


References

1.
Mohammed Y, Shahand S, Korkhov V, Luyf ACM, Schaik BDCv, Caan MWA, Kampen AHCv, Palmblad M, Olabarriaga SD. Data Decomposition in Biomedical e-Science Applications. In: IEEE 7th International Conference on E-Science, e-Science 2011; 2011 Dec 5-8, 2011; Stockholm, Sweden.
2.
Mohammed Y, Sax U, Dickman F, Lippert J, Solodenko J, Voigt Gv, Smith M, Rienhoff O. On transferring the grid technology to the biomedical community. Stud Health Technol Inform. 2010;159:28-39.
3.
Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11(8):R86. DOI: 10.1186/gb-2010-11-8-r86 External link
4.
Web Services at the EBI [Internet]. Available from; Available from: www.ebi.ac.uk/Tools/webservices/ [cited 15.04.2012]
5.
Barker A, van Hemert J. Scientific Workflow: A Survey and Research Directions Parallel Processing and Applied Mathematics. Berlin/Heidelberg: Springer; 2008. p. 746-53.
6.
Sotomayor B, Montero RS, Llorente IM, Foster I. An Open Source Solution for Virtual Infrastructure Management in Private and Hybrid Clouds. IEEE Internet Computing. 2009;13(5):14-22.
7.
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004;20(17):3045-54. DOI: 10.1093/bioinformatics/bth361 External link
8.
Brenner M, Wiebelitz J, Voigt Gv, Smith M. Secret Program Execution in the Cloud applying Homomorphic Encryption. In: 5th IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2011); 2011 May 31-June 3; Daejeon. p. 114-9.