Article
Reproducible bioinformatics workflows: A case study with software containers and interactive notebooks
Search Medline for
Authors
Published: | February 26, 2021 |
---|
Outline
Text
Background: Reproducible specification of workflows in bioinformatics is challenging given their complexity. We developed a new statistical method in the field of circadian rhythmicity, which allows to rigorously determine whether measured quantities such as gene expression are not rhythmic. The statistical method itself was implemented in the R package “HarmonicRegression”, available on the CRAN repository. However, the bioinformatics workflow is much larger than the statistical test. For instance, to ensure the applicability and validity of the statistical method, we simulated data sets of 20,000 gene expressions over two days, with a large range of parameter combinations (e.g. sampling interval, fraction of rhythmicity, amount of outliers, detection limit of rhythmicity, etc.).
Methods: We describe and demonstrate the use of Jupyter notebooks to document, specify, and distribute our statistical method and its application to both simulated and experimental data sets. Jupyter notebooks combine text documentation with dynamically editable and executable code.
Results: Thus, parameters and code can be dynamically modified, allowing both verification of results, as well as instant experimentation. The notebook runs inside a Docker software container, which mirrors the original software environment and avoids the need to install any software.
Conclusion: The Docker container and the Jupyter notebook will be available on GitHub, accompanying our paper with preprint available on bioRxiv. This frameworkensures complete long-term reproducibility of the workflow.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.