Screening for (multi) collinearity in a regression model

Question

Screening for (multi) collinearity in a regression model

I hope this question does not ask-and-answer ... here goes: (multiple) collinearity refers to the extremely high correlations between predictors in the regression model. How to cure them ... well, sometimes you don’t need to “cure” collinearity, since it does not affect the regression model, but rather on the interpretation of the effect of individual predictors.

One way to determine collinearity is to include each predictor as a dependent variable and other predictors as independent variables, determine R ^2, and if it is greater than 0.9 (or 0.95), we can consider the predictor redundancy. This is one “method” ... what about other approaches? Some of them take a lot of time, for example, excluding predictors from the model and observing changes in the b-coefficient - they should differ markedly.

Of course, we always have to take into account the specific context / purpose of the analysis ... Sometimes only repetition is a repetition of the study, but right now I am interested in various ways to screen redundant predictors when (a lot) collinearity occurs in the regression model.

+59

r statistics regression

aL3xa Jun 15 '10 at 2:10

source share

5 answers

To add to what Dirk said about the condition number method, the rule is that CN > 30 indicate severe collinearity values CN > 30 indicate severe collinearity . Other methods besides the condition number include:

1) the covariance determinant is a matrix that ranges from 0 (Perfect Collinearity) to 1 (no collinearity)

 # using Dirk example > det(cov(mm12[,-1])) [1] 0.8856818 > det(cov(mm123[,-1])) [1] 8.916092e-09

2) Using the fact that the determinant of a diagonal matrix is a product of eigenvalues => The presence of one or more small eigenvalues indicates collinearity

 > eigen(cov(mm12[,-1]))$values [1] 1.0876357 0.8143184 > eigen(cov(mm123[,-1]))$values [1] 5.388022e+00 9.862794e-01 1.677819e-09

3) The value of the coefficient of variation (VIF). VIF for the predictor i is 1 / (1-R_i ^ 2), where R_i ^ 2 is R ^ 2 from the regression of the predictor i from the rest of the predictors. Collinearity is present when the VIF for at least one independent variable is large. Thumb Rule: VIF > 10 is of concern . For implementation in R see here . I would also like to comment that using R ^ 2 to determine collinearity should go hand in hand with a visual analysis of scattering patterns, because a single outlier can “cause” collinearity where it does not exist, or it can hide collinearity where it exists ,

+33

George Dontas Jun 15 '10 at 8:23

source share

You may like the Vito Ricci Reference Card "R Functions for Regression Analysis" http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf

He briefly lists many useful regression related functions in R, including diagnostic functions. In particular, it lists the vif function from the car package, which can evaluate multicollinearity. http://en.wikipedia.org/wiki/Variance_inflation_factor

Consideration of multicollinearity often goes hand in hand with issues of variable importance assessment. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/

+17

Jeromy Anglim Jun 15 '10 at 9:08

source share

See also section 9.4 in this book: Practical Regression and Anova Using R [Faraway 2002] .

Collinearity can be detected in several ways:

A study of the correlation matrix of predictors will reveal large paired collinearity.
Regression x_i for all other predictors gives R ^ 2_i. Repeat for all predictors. R ^ 2_i, close to one, indicates a problem - you can find an offensive linear combination.
To study the eigenvalues t(X) %*% X , where X denotes the model matrix; Small eigenvalues indicate a problem. It can be shown that the condition number of the 2-norm is the ratio of the largest to the smallest nonzero singular value of the matrix ($ \ kappa = \ sqrt {\ lambda_1 / \ lambda_p} $; see ?kappa ); \kappa >= 30 is considered large.

+8

rcs Jun 15 '10 at 7:50

source share

Since there is no mention of VIF yet, I will add my answer. Deviation An inflation factor> 10 usually indicates severe redundancy between predictor variables. VIF indicates the coefficient at which the variance of the efficiency coefficient would increase if it did not correlate strongly with other variables.

vif() is available in the cars package and applies to an object of class (lm). It returns vif x1, x2., Xn in the lm() object. It is a good idea to exclude variables with vif> 10 or introduce conversions to variables with vif> 10.

+7

vagabond Jul 25 '14 at 20:50

source share

Dirk Eddelbuettel · Accepted Answer · 2010-06-15 02:58

The kappa() function may help. Here is a simulation example:

 > set.seed(42) > x1 <- rnorm(100) > x2 <- rnorm(100) > x3 <- x1 + 2*x2 + rnorm(100)*0.0001 # so x3 approx a linear comb. of x1+x2 > mm12 <- model.matrix(~ x1 + x2) # normal model, two indep. regressors > mm123 <- model.matrix(~ x1 + x2 + x3) # bad model with near collinearity > kappa(mm12) # a 'low' kappa is good [1] 1.166029 > kappa(mm123) # a 'high' kappa indicates trouble [1] 121530.7

and we go further, making the third regressor more collinear:

 > x4 <- x1 + 2*x2 + rnorm(100)*0.000001 # even more collinear > mm124 <- model.matrix(~ x1 + x2 + x4) > kappa(mm124) [1] 13955982 > x5 <- x1 + 2*x2 # now x5 is linear comb of x1,x2 > mm125 <- model.matrix(~ x1 + x2 + x5) > kappa(mm125) [1] 1.067568e+16 >

These are the approximations used, see help(kappa) for details.

Screening for (multi) collinearity in a regression model

More articles: