Standard errors for multivariate regression coefficients

Question

Standard errors for multivariate regression coefficients

I did multivariate regression using sklearn.linear_model.LinearRegression and got regression coefficients:

import numpy as np from sklearn import linear_model clf = linear_model.LinearRegression() TST = np.vstack([x1,x2,x3,x4]) TST = TST.transpose() clf.fit (TST,y) clf.coef_

Now I need standard errors for the same coefficients. How can i do this? Thank you very much.

+6

python scikit-learn linear-regression

Carolina_G Jan 05 '14 at 19:34

source share

5 answers

 MSE = np.mean((y - clf.predict(TST).T)**2) var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST))) SE_est = np.sqrt(var_est)

I think this answer is not entirely correct. In particular, if I'm not mistaken, according to your code, sklearn adds a constant term to calculate the default coefficient.

Then you need to include a column of them in your TST matrix. Then the code is correct and it will give you an array with all SE

+1

Luca May 01 '16 at 12:20

source share

This code has been tested with data. They are true.

find the matrix X for each data set, n is the length of the data set, m is the number of variables

 X, n, m=arrays(data) y=***.reshape((n,1)) linear = linear_model.LinearRegression() linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it one variable only.

sum of square

 s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)

standard deviation, the square root of the diagonal of the dispersion-co-dispersion matrix (partition of the sigular vector)

 sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(XT,X)))))

(t-statistics using, linear.intercept_ for one variable)

 t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_

0

Linna du Oct 9 '16 at 22:49

source share

I found that the accepted answer had some malfunctioning issues that would generally require editing outside the recommended etiquette to change the messages . So, here is the solution for calculating the standard error estimate for the coefficients obtained using the linear model (using the unbiased estimate proposed here ):

 # preparation X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1) y_hat = clf.predict(TST).T m, n = X.shape # computation MSE = np.sum((y_hat - y)**2)/(m - n) coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(XT,X))) coef_SE_est = np.sqrt(var_est)

Note that we must add the unit column to the TST since the original message used linear_model.LinearRegression so that it matches the interception terms. In addition, we need to calculate the mean square error (MSE), as in ANOVA . That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e. Df_error = df_observations - df_features.

The resulting coef_SE_est array contains standard estimates of interception errors and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] respectively. To print them you can use

 print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0])) for i, coef in enumerate(clf.coef_[0,:]): print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))

0

fxx Sep 28 '17 at 3:45

source share

An example from the documentation shows how to get the standard error and the explained variance indicator:

 # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # The coefficients print('Coefficients: \n', regr.coef_) # The mean square error print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

Does it cover what you need?

-1

jonrsharpe Jan 05 '14 at 20:04

source share

Kyler Brown · Accepted Answer · 2014-01-05T21:49:13+0000

Based on this question of statistics and wikipedia , my best guess is

 MSE = np.mean((y - clf.predict(TST).T)**2) var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST))) SE_est = np.sqrt(var_est)

However, my linear algebra and statistics are pretty bad, so I may not notice something important. Another option may be to download the variance estimate.

Standard errors for multivariate regression coefficients

find the matrix X for each data set, n is the length of the data set, m is the number of variables

sum of square

standard deviation, the square root of the diagonal of the dispersion-co-dispersion matrix (partition of the sigular vector)

(t-statistics using, linear.intercept_ for one variable)

More articles: