Coefficients lm with different orders of factors in the formula

I am trying to analyze some results of a linear model in R, in particular, I am interested in the p values ​​presented for independent variables in the summary of the lm object (I know that there is a more complicated way of comparing the relevance of variables, but some comparisons in the past convinced me that for preliminary analysis, these p-values ​​will do). I was convinced that these p-values ​​did not depend on the order in which the variables are indicated in the formula (which is incorrect when using anova, for example), so I am puzzled by some results on fake data that I get

> x<-rnorm(100) > y <- 2*x > xJ <- jitter(x) > lm1 <- lm(y~x) > lm2 <- lm(y~x+xJ) > lm3 <- lm(y~xJ+x) > summary(lm1)$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) -2.220446e-17 4.064501e-17 -5.463023e-01 0.5860998 x 2.000000e+00 4.037817e-17 4.953172e+16 0.0000000 > summary(lm2)$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 0.000000e+00 4.271540e-17 0.000000e+00 1.0000000 x 2.000000e+00 3.534137e-13 5.659091e+12 0.0000000 xJ 4.147502e-13 3.534140e-13 1.173553e+00 0.2434475 > summary(lm3)$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) -1.594538e-18 5.512644e-21 -2.892511e+02 3.147977e-144 xJ -3.531641e-16 4.560990e-17 -7.743146e+00 9.391428e-12 x 2.000000e+00 4.560986e-17 4.385017e+16 0.000000e+00 

Where is my mistake?

thanks

+4
source share
1 answer

After thinking a little more about it, I think that in addition to any strange floating-point problems, the cause of the instability in the coefficients is mulitcollinearity , due to the fact that x and xJ almost completely correlated. Conducting a quick analysis of dispersion inflation factors:

 library(car) vif(lm2) x xJ 103233533 103233533 

VIFs exceeding 5 are generally considered to be something to look at, so in this case it is not surprising that the coefficients move a bit.

+2
source

All Articles