Artikel
Quality control of next generation sequencing data using random forests as probability machines
Suche in Medline nach
Autoren
Veröffentlicht: | 27. August 2013 |
---|
Gliederung
Text
Introduction: Next generation DNA sequencing technologies are a promising tool to identify rare genetic variants controlling susceptibility of complex diseases. Sequencing of many individuals usually identifies thousands of rare variants and some of them are sequencing or alignment artifacts. Quality control to detect variants of low quality is therefore mandatory. However, research is still needed to determine the best quality control algorithms.
Material and methods: The standard output of a variant calling pipeline is a list of identified genetic variants along with several quality characteristics. Instead of hard filtering based on specific variables and thresholds, a machine learning algorithm like random forests can be trained to use these measurements for each variant to predict the probability of having low quality. Training data is generated by comparing genotypes of control samples in our study with external data to detect mismatches. We evaluate our new approach on a whole exome sequencing study (WES) of oral clefts and a whole genome sequencing study (WGS) in an Amish family. Both data sets include several HapMap samples as controls.
Results: In the WES, most of the variants with an estimated high probability of being bad are also flagged by another quality control algorithm. We observe differences in several quality characteristics for variants that are detected as problematic by only one of the two methods, e.g. our method flags variants with lower mapping quality. WGS analyses are still ongoing and results will be presented at the workshop.
Discussion: Random forest is a promising approach to evaluate quality of variants in next generation sequencing studies. Estimated probabilities can be used to rank interesting variants instead of hard filtering and a combination with other quality control algorithms might be used to identify variants with very low quality.