gms | German Medical Science

GMDS 2014: 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

07. - 10.09.2014, Göttingen

A genetic algorithm for generating correlated binary variables – with applications to logistic regression models

Meeting Abstract

Search Medline for

  • K. Jung - Universitätsmedizin Göttingen, Göttingen

GMDS 2014. 59. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Göttingen, 07.-10.09.2014. Düsseldorf: German Medical Science GMS Publishing House; 2014. DocAbstr. 109

doi: 10.3205/14gmds176, urn:nbn:de:0183-14gmds1765

Published: September 4, 2014

© 2014 Jung.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en). You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.


Outline

Text

Introduction: Correlated binary variables are frequently observed in many scientific studies. Typical examples are the risk predictors of a disease in the setting of logistic regression models or repeated measures in a longitudinal setting. In order to evaluate statistical methods for the analysis of such data, techniques for simulating random numbers from a multivariate binary distribution are necessary. One of the first proposed techniques for that purpose simply draws a sample from the multivariate normal distribution and then dichotomizes this data according to the desired marginal probabilities [1]. Other algorithms are based for example on correlated Poisson variables [2] or multinomial distributed variables [3]. Commonly, these approaches propose different ways to fully assess the joint distribution of all variables, given their marginal means and their correlation structure. However, these approaches come generally along with restrictions either on the range of the input parameters or they are not feasible for a larger number of variables. The advantages and disadvantages of existing approaches have been reviewed in [4].

Methods: In this talk a new genetic algorithm for iterating the joint distribution of correlated binary variables is proposed. The iteration performance of this algorithm is evaluated with respect to prespecified marginal means, correlation structures and the number of variables. The applicability of this new approach is further studied in different scenarios of sample size planning for logistic regression models.

Results: The proposed technique can cope with a large range of the input parameters. For small numbers (≤5) of variables the joint distribution is iterated within an acceptable number of steps, i.e. within seconds, for larger numbers of variables the iteration can take minutes or longer. The precision of iteration can be prespecified in terms of the deviation from the given correlation matrix. The marginal means are usually exactly represented by using this method.

Discussion: In contrast to existing methods the genetic algorithm is more flexible with regard to the input parameters, although it can lack in precision for larger numbers of variables. The new technique is also useful for simulating the necessary sample size of logistic regression models.


References

1.
Emrich JE, Piedmonte MR. A method for generating high-dimensional multivariate binary variates. Stat Comp. 1991;45:302-4.
2.
Park C, Park T, Shin D. A simple method for generating correlated binary variates. Am Stat. 1996;50:306-10.
3.
Kang SH, Jung SH. Generating binary variables with complete specification of the joint distribution. Biom J. 2001;43:263-9.
4.
Farrell P, Rogers-Steward K. Methods for generating longitudinally correlated binary data. Int Stat Rev. 2008;76:28-38.