gms | German Medical Science

65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS)

06.09. - 09.09.2020, Berlin (online conference)

Testing methods to analyze small sample size GWAS with different settings

Meeting Abstract

Search Medline for

  • Alicia Poplawski - Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI), Universitätsmedizin der Johannes-Gutenberg-Universität Mainz, Mainz, Germany
  • Konstantin Strauch - Institut für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI), Universitätsmedizin der Johannes-Gutenberg-Universität Mainz, Mainz, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 65th Annual Meeting of the German Association for Medical Informatics, Biometry and Epidemiology (GMDS), Meeting of the Central European Network (CEN: German Region, Austro-Swiss Region and Polish Region) of the International Biometric Society (IBS). Berlin, 06.-09.09.2020. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 446

doi: 10.3205/20gmds380, urn:nbn:de:0183-20gmds3800

Published: February 26, 2021

© 2021 Poplawski et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Background: Due to small effect sizes and the huge number of tested single-nucleotide polymorphisms (SNPs) genome-wide association studies (GWAS) require very large sample sizes, in order to identify genetic variants associated with a genetic disease. However, often only fairly small sample sizes are available. In order to detect the best way to identify SNPs and indels in the comparison of different experimental conditions (as for example tumor vs. matched normal samples) with only small sample size, different analytical tools were tested using simulated data. The power and false discovery rate were calculated using thresholds at different levels.

Methods: Whole genome sequencing data of germline DNA were simulated for leukemia patients and unaffected controls by resampling “1000 Genome Project” data using hapgen2 [1]. Known leukemia SNPs were simulated along a wide set of odds ratio (OR) values. Different methods, such as rvtests [2], logistic regression, and likelihood based boosting [3] were employed to analyze the simulated data.

Results: Results were obtained regarding the ability to identify disease-causing variants on SNP and on gene level investigating the power and the false discovery rate. The boosting algorithm utilized resampling techniques and SNPs are selected based on how often they were included in the model (inclusion frequencies). Boosting results were compared for different inclusion frequencies thresholds and the other methods for different p-value thresholds. Independent from the selected boundary and from the OR, boosting failed to detect variants on SNP level and also seems to be inferior to logistic regression for the identification on gene level. Results for rvtests are not available yet.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011 Aug 15;27(16):2304-5. DOI: 10.1093/bioinformatics/btr341 External link
2.
Zhan X, Hu Y, Li B, Abecasis GR, Liu DJ. RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data. Bioinformatics. 2016 May 1;32(9):1423-6. DOI: 10.1093/bioinformatics/btw079 External link
3.
Binder H, Binder MH. Package ‘GAMBoost’. 2015.