Linear regression using lm () - surprised by the result

I used linear regression for the data I have using the lm function. Everything works (there is no error message), but I am somehow surprised by the result: under my impression R "skips" a group of points, i.e. Interception and tilt are not suitable. For example, I mean a group of points in the coordinates x = 15-25, y = 0-20.

My questions:

  • Is there a comparison function for matching with “expected” coefficients and “lm-calculated” coefficients?
  • I made a silly coding mistake by forcing lm to do what?

Following some answers: more info on x and y

x and y are visual estimates of the symptoms of the disease. They have the same uncertainty. Data graph with linear regression and display of expected results

Data and code are here:

 x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6, 21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1, 20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4, 18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5, 17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4, 29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2, 27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0, 34.2,34.0,32.6,30.8) y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8, 45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5, 10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7, 3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4, 1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5, 47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0, 86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5, 84.0,93.5,83.5,96.5,96.0,97.5) ## x11() plot(x1,y1,xlim=c(0,35),ylim=c(0,100)) # linear regression reg_lin=lm(y1 ~ x1) abline(reg_lin,lty="solid", col="royalblue") text(12.5,25,labels="R result",col="royalblue", cex=0.85) text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85) # result I would have imagined abline(a=-150,b=8,lty="dashed", col="red") text(27.5,25,labels="What I think is better",col="red", cex=0.85) text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85) 
+8
r linear-regression lm least-squares orthogonal
source share
2 answers

Try the following:

 reg_lin_int <- reg_lin$coefficients[1] reg_lin_slp <- reg_lin$coefficients[2] sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2) # [1] 39486.33 sum((y1 - (-150 + 8 * x1)) ^ 2) # [1] 55583.18 

The sum of the squared residuals below the line lm . This is to be expected, since reg_lin_int and reg_lin_slp guarantee a minimal squared error.

Intuitively, we know that the squared estimates of the loss functions are outlier sensitive. The group is “missing” at the bottom because it approaches the group in the upper left corner, which is much further away - and the square of the distance gives these points more weight.

In fact, if we use the regression of the least absolute deviations (i.e., indicate the absolute loss function instead of the square), the result is much closer to your guess:

 library(quantreg) lad_reg <- rq(y1 ~ x1) 

lad

(Tip: use lwd to make your graphs more readable)

Which gets even closer to what you had in mind, Total Least Squares , as mentioned by @nongkrong and @MikeWilliamson. Here is the result of TLS on your example:

 v <- prcomp(cbind(x1, y1))$rotation bbeta <- v[-ncol(v), ncol(v)] / v[1, 1] inter <- mean(y1) - bbeta * mean(x1) 

tls

+8
source share

You already have a good answer, but maybe this is also useful:

As you know, OLS minimizes the sum of squared errors in the y direction. This means that the uncertainty of your x-values ​​is negligible, which often happens. But perhaps this is not the case for your data. Assuming the uncertainties in x and y are equal and the Deming regression, we get a match that is more similar to what you expected.

 library(MethComp) dem_reg <- Deming(x1, y1) abline(dem_reg[1:2], col = "green") 

final schedule

You do not provide detailed information about your data. Thus, it may be useful or not.

+6
source share

All Articles