Standard errors for multivariate regression coefficients

I did multivariate regression using sklearn.linear_model.LinearRegression and got regression coefficients:

import numpy as np from sklearn import linear_model clf = linear_model.LinearRegression() TST = np.vstack([x1,x2,x3,x4]) TST = TST.transpose() clf.fit (TST,y) clf.coef_ 

Now I need standard errors for the same coefficients. How can i do this? Thank you very much.

+6
source share
5 answers

Based on this question of statistics and wikipedia , my best guess is

 MSE = np.mean((y - clf.predict(TST).T)**2) var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST))) SE_est = np.sqrt(var_est) 

However, my linear algebra and statistics are pretty bad, so I may not notice something important. Another option may be to download the variance estimate.

+4
source
 MSE = np.mean((y - clf.predict(TST).T)**2) var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST))) SE_est = np.sqrt(var_est) 

I think this answer is not entirely correct. In particular, if I'm not mistaken, according to your code, sklearn adds a constant term to calculate the default coefficient.

Then you need to include a column of them in your TST matrix. Then the code is correct and it will give you an array with all SE

+1
source

This code has been tested with data. They are true.

find the matrix X for each data set, n is the length of the data set, m is the number of variables

 X, n, m=arrays(data) y=***.reshape((n,1)) linear = linear_model.LinearRegression() linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it one variable only. 

sum of square

 s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1) 

standard deviation, the square root of the diagonal of the dispersion-co-dispersion matrix (partition of the sigular vector)

 sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(XT,X))))) 

(t-statistics using, linear.intercept_ for one variable)

 t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_ 
0
source

I found that the accepted answer had some malfunctioning issues that would generally require editing outside the recommended etiquette to change the messages . So, here is the solution for calculating the standard error estimate for the coefficients obtained using the linear model (using the unbiased estimate proposed here ):

 # preparation X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1) y_hat = clf.predict(TST).T m, n = X.shape # computation MSE = np.sum((y_hat - y)**2)/(m - n) coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(XT,X))) coef_SE_est = np.sqrt(var_est) 

Note that we must add the unit column to the TST since the original message used linear_model.LinearRegression so that it matches the interception terms. In addition, we need to calculate the mean square error (MSE), as in ANOVA . That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e. Df_error = df_observations - df_features.

The resulting coef_SE_est array contains standard estimates of interception errors and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] respectively. To print them you can use

 print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0])) for i, coef in enumerate(clf.coef_[0,:]): print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1])) 
0
source

An example from the documentation shows how to get the standard error and the explained variance indicator:

 # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # The coefficients print('Coefficients: \n', regr.coef_) # The mean square error print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test)) 

Does it cover what you need?

-1
source

All Articles