gms | German Medical Science

GMDS 2013: 58. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

01. - 05.09.2013, Lübeck

Quality control of next generation sequencing data using random forests as probability machines

Meeting Abstract

  • Silke Szymczak - Christian-Albrechts-Universität zu Kiel, Kiel, DE; National Human Genome Research Institute, NIH, Baltimore, US
  • Hua Ling - Center for Inherited Disease Research, Johns Hopkins University, Baltimore, US
  • Terri H Beaty - Department of Epidemiology, Johns Hopkins University, Baltimore, US
  • Joan E Bailey-Wilson - National Human Genome Research Institute, NIH, Baltimore, US

GMDS 2013. 58. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Lübeck, 01.-05.09.2013. Düsseldorf: German Medical Science GMS Publishing House; 2013. DocAbstr.312

doi: 10.3205/13gmds312, urn:nbn:de:0183-13gmds3127

Veröffentlicht: 27. August 2013

© 2013 Szymczak et al.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.de). Er darf vervielfältigt, verbreitet und öffentlich zugänglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.


Gliederung

Text

Introduction: Next generation DNA sequencing technologies are a promising tool to identify rare genetic variants controlling susceptibility of complex diseases. Sequencing of many individuals usually identifies thousands of rare variants and some of them are sequencing or alignment artifacts. Quality control to detect variants of low quality is therefore mandatory. However, research is still needed to determine the best quality control algorithms.

Material and methods: The standard output of a variant calling pipeline is a list of identified genetic variants along with several quality characteristics. Instead of hard filtering based on specific variables and thresholds, a machine learning algorithm like random forests can be trained to use these measurements for each variant to predict the probability of having low quality. Training data is generated by comparing genotypes of control samples in our study with external data to detect mismatches. We evaluate our new approach on a whole exome sequencing study (WES) of oral clefts and a whole genome sequencing study (WGS) in an Amish family. Both data sets include several HapMap samples as controls.

Results: In the WES, most of the variants with an estimated high probability of being bad are also flagged by another quality control algorithm. We observe differences in several quality characteristics for variants that are detected as problematic by only one of the two methods, e.g. our method flags variants with lower mapping quality. WGS analyses are still ongoing and results will be presented at the workshop.

Discussion: Random forest is a promising approach to evaluate quality of variants in next generation sequencing studies. Estimated probabilities can be used to rank interesting variants instead of hard filtering and a combination with other quality control algorithms might be used to identify variants with very low quality.