gms | German Medical Science

GMDS 2013: 58. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

01. - 05.09.2013, Lübeck

Specific identification of small genomic structural variations using next generation sequencing data

Meeting Abstract

Suche in Medline nach

  • Matthias Kuhn - TU Dresden, Dresden, DE
  • Ingo Röder - TU Dresden, Dresden, DE

GMDS 2013. 58. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Lübeck, 01.-05.09.2013. Düsseldorf: German Medical Science GMS Publishing House; 2013. DocAbstr.196

doi: 10.3205/13gmds182, urn:nbn:de:0183-13gmds1825

Veröffentlicht: 27. August 2013

© 2013 Kuhn et al.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.de). Er darf vervielfältigt, verbreitet und öffentlich zugänglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.


Gliederung

Text

Introduction: Next generation sequencing (NGS) is a technique that promises to unbundle genetic variability with low bias and hence to advance our understanding of, e.g. tumorigenesis. Changes of single nucleotides (SNV) and insertions/ deletions of up to 50 nucleotides (Indels) form the best known source of genetic variation.Structural variations (SV) involving bigger chunks of DNAare the second major form of genomic variation [1]. Currently, there exist well established methods which allow to call SNVs and Indels by basically mapping and comparing NGS reads to a reference genome. Also concerning SVs, methods have been developed for their identification. In contrast to SNV calling, those methods mainly use reads that could not be properly mapped to the reference genome. However, most SV calling methods are tuned to find rather big SVs. Hence, there is a grey zone of genetic variants in between SNVs and SVs that may not be detected by any of these methods [2]. We, therefore, describe a method to specifically find genomic variation in the range of four up to 150 nucleotides with NGS data.

Material and Methods: The proposed method relies on a NGS read mapper that is able to map reads partially, generating clipped reads and multi-part alignments. We used the recently released BWA MEM algorithm [3] and call small SVs in a two-step procedure. First, we identify candidate positions for a small SV by looking at an increase in the rate of clipped reads. At those candidate positions, we extract further features that characterize local mapping patterns, for instance read number, mapping quality, insert size, or frequency of multi-part alignments. Based on those features, a classifier eventually calls the small SV. The classifier was trained on simulated NGS sequence data with and without small SVs.

Results: We simulated Illumina NGS reads from certain regions of the human genome with and without small SVs. The usage of clipped reads to find candidate regions for small SVs turned out to be a sensitive method when coverage of mutated reads is at least 25. From the simulations we further extracted features of those reads that mapped into the candidate positions in order to train a classifier for small SVs. The model showed high sensitivity and specificity.

Discussion: We present a method to detect genomic variation in the grey zone between SNVs and SV, ranging from four to 150 nucleotides. The methods uses clipped reads to pinpoint small SV candidate positions and subsequently applies a classifier based on features of the local reads. This combination of clipped reads and multi-part alignment represents - to the best of our knowledge - a new approach to small SV calling. For now,the results are restricted because we simulated only insertion and duplication events. Extension to further types of SVs are pending. Also, a sensitivity analysis on the effect of key parameters from sequencing (enrichment, coverage), mapping (penalty scores) and the classifier remains to be done. The method could be enhanced to call differential genomic variants when comparing tumor and healthy samples.


References

1.
1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73.
2.
Chiara M, Pesole G, Horner DS. SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data. Nucleic Acids Res. 2012 Oct;40(18):e145.
3.
Li H. BWA [Internet]. Available from: http://bio-bwa.sourceforge.net/ Externer Link