Article
Pseudo R-squared measures for generalised linear regression models
Search Medline for
Authors
Published: | September 14, 2004 |
---|
Outline
Text
A regression model provides a rather simple image of a situation, which is intrinsically complex in general. The importance of potential prognostic factors is assessed in order to enable an improved prognosis of the interesting outcome variable. The closer the model is to reality, the more variability of the outcome variable can be explained. In a common linear regression model with normally distributed outcome the fraction of the explained variability is quantified by the coefficient of determination, also called R-squared measure. It provides additional information besides the parameter estimates, p-values and confidence intervals of the covariates. In the ideal case of a perfect prognosis R-squared would achieve one. On the other hand, if nothing at all can be explained by the model, the lower bound of zero is attained.
As R-squared is well-known and commonly used in the linear regression model, attempts have been made to define pseudo R-squared values for generalised linear models as well [Ref. 1], [Ref. 2], [Ref. 3], [Ref. 4], although the generalisation to non-normal data is not straightforward. Recommended pseudo R-squared measures are either based on the concept of deviance or on sums-of-squares, which may result in different estimates for the same data. Advantages and disadvantages of different approaches are compared and discussed.
Regression models are often used to screen for prognostic factors, even in situations where the sample size is rather small compared to the number of covariates. By definition, an R-squared measure increases monotonically if covariates are added to the model even if they are not correlated with the interesting outcome. That is, unadjusted R-squared measures may be substantially inflated, jeopardizing the ability to draw valid interpretations. R-squared values of 30 percent or higher can easily be reached, even when no association between independent and dependent variables exists at all. The use of bias-adjusted R-squared measures, which consider also the number of parameters fitted, is well established in linear regression models. For generalised linear models a shrinkage-based adjustment of the deviance-based pseudo R-squared measure is proposed, so that the expectation of the adjusted pseudo R-squared measure corresponds to the underlying population value [Ref. 3], [Ref. 4], [Ref. 5], [Ref. 6]. Furthermore we show that the resulting adjustment coincides with the adjusted R-squared measure in linear regression. The adjustment can also be generalised to the case of over- and under-dispersed Poisson regression models [Ref. 7].
In summary, correctly adjusted R-squared values give essential information additional to the usual modelling results since they allow to quantify the current knowledge (or nescience) about the interesting outcome variable.
References
- 1.
- Mittlböck M, Schemper M. Explained variation for logistic regression. Statistics in Medicine 1996; 15: 1987-1997.
- 2.
- Mittlböck M, Schemper M. Computing measures of explained variation for logistic regression models. Computer Methods and Programs in Biomedicine 1999; 58: 17-24.
- 3.
- Mittlböck M, Heinzl H. Measures of explained variation in Gamma regression model. Communications in Statistics - Simulation and Computation 2002; 31: 61-73.
- 4.
- Heinzl H, Mittlböck M. R-squared measures for the inverse Gaussian regression model. Computational Statistics 2002: 17: 525-544.
- 5.
- Mittlböck M, Waldhör T. Adjustments for R²-Measures for Poisson regression models. Computational Statistics & Data Analysis 2000; 34: 461-472.
- 6.
- Mittlböck M. Calculating Adjusted R² Measures for Poisson Regression Models. Computer Methods and Programs in Biomedicine 2002; 68: 205-214.
- 7.
- Heinzl H, Mittlböck M. Pseudo R-squared measures for Poisson regression models with over- or under-dispersion. Computational Statistics and Data Analysis 2003; 44: 253-271