gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

Unlocking OpenData Value: Utilizing the American Gut Project Data Test Bed

Meeting Abstract

  • Alexander Birkenkamp - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Germany
  • Christian R. Bauer - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Germany
  • Theresa Bender - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Germany
  • Cornelius Knopp - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Germany
  • Ulrich Sax - Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Göttingen, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 198

doi: 10.3205/19gmds040, urn:nbn:de:0183-19gmds0406

Published: September 6, 2019

© 2019 Birkenkamp et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at



Introduction: The crowd-funded study American Gut Project (AGP, started in November 2012 investigating microbiomes in and on the human body, encompassing, e.g., gut, mouth, and skin, and its relationship to the participants‘ characteristics [1]. Citizens all over the world are encouraged to submit tissue samples together with demographic, dietary, lifestyle, hygiene, and health information obtained from a questionnaire. Until early 2017 more than 10,000 participants contributed to the study [2]. The collected data are publicly available e.g. on MGnify [3], a data resource and analysis platform of the European Molecular Biology Laboratory (EMBL) for (metagenomic) sequence data, and BioSamples [4], a data resource for sample-related metadata. As opposed to Qiita [5], a data source mentioned by [1], these sources do not require previous user registration and therefore facilitate open access to the data.

To boost the quality and effectiveness of a local research project, we aimed to create access to a big data pool of topical data for easy access by scientists. This data pool should be useable for hypothesizing and a source for comparison to, and extension of new data created locally.

Methods: We wanted to use our established local infrastructure approach [6] and therefore chose an i2b2/tranSMART database as our data pool platform. Data from the AGP were retrieved and cleaned through a Python script (Python version 3.5.2) run on an Ubuntu 16.04.3 LTS server. All proband-related phenotype data, data about utilized samples as well as metagenomic data, were requested in JSON format via the RESTful APIs of BioSamples and MGnify by four worker processes in about 200 minutes. Relevant attributes were identified based on a questionnaire and a data dictionary supplemented by [1]. Probands with missing age, height, weight, or BMI were excluded. The cleansed data set was then transformed into the transmart-batch format, in order to load it into a local dockerized tranSMART 16.2 instance.

Results: In total 8,245 data sets for individual subjects were integrated. These were partly comprised of multiple entries for identical subjects due to differently submitted sample types. After merging these sets, our data pool consisted of data from 6,620 distinct individuals of the AGP. To provide a minimal set of phenotype data for selection and querying, candidates with the above stated missings were excluded (1,013 subjects). This resulted in 5,607 probands with 148 items each for further analysis in our data pool.

Discussion: Cleansing and integrating of the AGP data is an ongoing process with many aspects still under development. Further data cleaning is required, because the data set contains a small number of implausible values and attributes without specified unites. Nevertheless, this data set offers an excellent show case for testing and demonstrating data discovery and plausibility tools. One reason we selected the i2b2/tranSMART platform was the ability to create joint and comparison queries on data sets from different sources in order to catalyse and test customized toolboxes like our extended microbiome workflow [7]. Both will be promoted in our next step: engaging with local scientists to use the local data pool.

This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the research and funding concepts e:Med (01ZX1606C/sysINFLAME) and i:DSem (031L0024A/MyPathSem).

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


McDonald D, Hyde E, Debelius JW, Morton JT, Gonzalez A, Ackermann G, Aksenov AA, Behsaz B, Brennan C, Chen Y, Goldasich LD. American gut: an open platform for citizen science microbiome research. mSystems. 2018 Jun 26;3(3):e00031-18. DOI: 10.1128/mSystems.00031-18 External link
Americangut. [Accessed 16 July 2019]. Available from: External link
Mitchell AL, Scheremetjew M, Denise H, Potter S, Tarkowska A, Qureshi M, Salazar GA, Pesseat S, Boland MA, Hunter FM, ten Hoopen P. EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies. Nucleic acids research. 2017;46(D1):D726-35. DOI: 10.1093/nar/gkx967 External link
Courtot M, Cherubin L, Faulconbridge A, Vaughan D, Green M, Richardson D, Harrison P, Whetzel PL, Parkinson H, Burdett T. BioSamples database: an updated sample metadata hub. Nucleic acids research. 2018 Nov 8;47(D1):D1172-8. DOI: 10.1093/nar/gky1061 External link
Gonzalez A, Navas-Molina JA, Kosciolek T, McDonald D, Vázquez-Baeza Y, Ackermann G, DeReus J, Janssen S, Swafford AD, Orchanian SB, Sanders JG. Qiita: rapid, web-enabled microbiome meta-analysis. Nature methods. 2018 Oct;15(10):796. DOI: 10.1038/s41592-018-0141-9 External link
Bauer CR, Umbach N, Baum B, Buckow K, Franke T, Grütz R, Gusky L, Nussbeck SY, Quade M, Rey S, Rottmann T. Architecture of a Biomedical Informatics Research Data Management Pipeline. In: MIE 2016; 2016 Sep 22; Munich; 2016. p. 262-266). DOI: 10.3233/978-1-61499-678-1-262 External link
Bauer CR, Knecht C, Fretter C, Baum B, Jendrossek S, Rühlemann M, Heinsen FA, Umbach N, Grimbacher B, Franke A, Lieb W. Interdisciplinary approach towards a systems medicine toolbox using the example of inflammatory diseases. Briefings in bioinformatics. 2017 May 1;18(3):479-87. DOI: 10.1093/bib/bbw024 External link