Artikel
A genetic algorithm for generating correlated binary variables – with applications to logistic regression models
Suche in Medline nach
Autoren
Veröffentlicht: | 4. September 2014 |
---|
Gliederung
Text
Introduction: Correlated binary variables are frequently observed in many scientific studies. Typical examples are the risk predictors of a disease in the setting of logistic regression models or repeated measures in a longitudinal setting. In order to evaluate statistical methods for the analysis of such data, techniques for simulating random numbers from a multivariate binary distribution are necessary. One of the first proposed techniques for that purpose simply draws a sample from the multivariate normal distribution and then dichotomizes this data according to the desired marginal probabilities [1]. Other algorithms are based for example on correlated Poisson variables [2] or multinomial distributed variables [3]. Commonly, these approaches propose different ways to fully assess the joint distribution of all variables, given their marginal means and their correlation structure. However, these approaches come generally along with restrictions either on the range of the input parameters or they are not feasible for a larger number of variables. The advantages and disadvantages of existing approaches have been reviewed in [4].
Methods: In this talk a new genetic algorithm for iterating the joint distribution of correlated binary variables is proposed. The iteration performance of this algorithm is evaluated with respect to prespecified marginal means, correlation structures and the number of variables. The applicability of this new approach is further studied in different scenarios of sample size planning for logistic regression models.
Results: The proposed technique can cope with a large range of the input parameters. For small numbers (≤5) of variables the joint distribution is iterated within an acceptable number of steps, i.e. within seconds, for larger numbers of variables the iteration can take minutes or longer. The precision of iteration can be prespecified in terms of the deviation from the given correlation matrix. The marginal means are usually exactly represented by using this method.
Discussion: In contrast to existing methods the genetic algorithm is more flexible with regard to the input parameters, although it can lack in precision for larger numbers of variables. The new technique is also useful for simulating the necessary sample size of logistic regression models.
References
- 1.
- Emrich JE, Piedmonte MR. A method for generating high-dimensional multivariate binary variates. Stat Comp. 1991;45:302-4.
- 2.
- Park C, Park T, Shin D. A simple method for generating correlated binary variates. Am Stat. 1996;50:306-10.
- 3.
- Kang SH, Jung SH. Generating binary variables with complete specification of the joint distribution. Biom J. 2001;43:263-9.
- 4.
- Farrell P, Rogers-Steward K. Methods for generating longitudinally correlated binary data. Int Stat Rev. 2008;76:28-38.