Article
Testing methods to analyze small sample size GWAS with different settings
Search Medline for
Authors
Published: | February 26, 2021 |
---|
Outline
Text
Background: Due to small effect sizes and the huge number of tested single-nucleotide polymorphisms (SNPs) genome-wide association studies (GWAS) require very large sample sizes, in order to identify genetic variants associated with a genetic disease. However, often only fairly small sample sizes are available. In order to detect the best way to identify SNPs and indels in the comparison of different experimental conditions (as for example tumor vs. matched normal samples) with only small sample size, different analytical tools were tested using simulated data. The power and false discovery rate were calculated using thresholds at different levels.
Methods: Whole genome sequencing data of germline DNA were simulated for leukemia patients and unaffected controls by resampling “1000 Genome Project” data using hapgen2 [1]. Known leukemia SNPs were simulated along a wide set of odds ratio (OR) values. Different methods, such as rvtests [2], logistic regression, and likelihood based boosting [3] were employed to analyze the simulated data.
Results: Results were obtained regarding the ability to identify disease-causing variants on SNP and on gene level investigating the power and the false discovery rate. The boosting algorithm utilized resampling techniques and SNPs are selected based on how often they were included in the model (inclusion frequencies). Boosting results were compared for different inclusion frequencies thresholds and the other methods for different p-value thresholds. Independent from the selected boundary and from the OR, boosting failed to detect variants on SNP level and also seems to be inferior to logistic regression for the identification on gene level. Results for rvtests are not available yet.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011 Aug 15;27(16):2304-5. DOI: 10.1093/bioinformatics/btr341
- 2.
- Zhan X, Hu Y, Li B, Abecasis GR, Liu DJ. RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data. Bioinformatics. 2016 May 1;32(9):1423-6. DOI: 10.1093/bioinformatics/btw079
- 3.
- Binder H, Binder MH. Package ‘GAMBoost’. 2015.