How to calculate r-square using Python and Numpy?

I use Python and Numpy to calculate the best suitable polynomial of arbitrary degree. I am passing a list of x values, y values ​​and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This works a lot, but I also want to calculate r (correlation coefficient) and r-square (determination coefficient). I compare my results with the best Excel trendline ability and the r-square value that it calculates. Using this, I know that I am correctly calculating the r-square for linear best fit (degree equal to 1). However, my function does not work for polynomials with degree greater than 1.

Excel can do this. How to calculate r-square for higher order polynomials using Numpy?

Here is my function:

import numpy # Polynomial Regression def polyfit(x, y, degree): results = {} coeffs = numpy.polyfit(x, y, degree) # Polynomial Coefficients results['polynomial'] = coeffs.tolist() correlation = numpy.corrcoef(x, y)[0,1] # r results['correlation'] = correlation # r-squared results['determination'] = correlation**2 return results 
+82
python math numpy statistics curve-fitting
May 21 '09 at 15:55
source share
8 answers

From numpy.polyfit, the documentation corresponds to linear regression. In particular, numpy.polyfit with a degree of "d" corresponds to a linear regression with an average function

E (y | x) = p_d * x ** d + p_ {d-1} * x ** (d-1) + ... + p_1 * x + p_0

So, you just need to calculate the R-square for this fit. The wikipedia page for linear regression provides complete information. You are interested in R ^ 2, which you can calculate in several ways, it is possible that easisest

 SST = Sum(i=1..n) (y_i - y_bar)^2 SSReg = Sum(i=1..n) (y_ihat - y_bar)^2 Rsquared = SSReg/SST 

Where I use "y_bar" for the average value of y, and "y_ihat" is the correspondence value for each point.

I am not very familiar with numpy (I usually work in R), so there is probably a more accurate way to calculate your R-square, but the following should be correct.

 import numpy # Polynomial Regression def polyfit(x, y, degree): results = {} coeffs = numpy.polyfit(x, y, degree) # Polynomial Coefficients results['polynomial'] = coeffs.tolist() # r-squared p = numpy.poly1d(coeffs) # fit values, and mean yhat = p(x) # or [p(z) for z in x] ybar = numpy.sum(y)/len(y) # or sum(y)/len(y) ssreg = numpy.sum((yhat-ybar)**2) # or sum([ (yihat - ybar)**2 for yihat in yhat]) sstot = numpy.sum((y - ybar)**2) # or sum([ (yi - ybar)**2 for yi in y]) results['determination'] = ssreg / sstot return results 
+54
May 21 '09 at 20:48
source share

A very late answer, but in case someone needs a ready-made function for this:

scipy.stats.linregress

those.

 slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y) 

as in the answer of Adam Marple.

+108
04 Oct '09 at 21:15
source share

From yanl (another-another-library) sklearn.metrics function r2_square ;

 from sklearn.metrics import r2_score coefficient_of_dermination = r2_score(y, p(x)) 
+44
Jun 12 '15 at 13:41
source share

I have successfully used this application, where x and y are of array type.

 def rsquared(x, y): """ Return R^2 where x and y are array-like.""" slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y) return r_value**2 
+20
Dec 17 '12 at 16:37
source share

I originally posted the tests below to recommend numpy.corrcoef , stupidly not realizing that corrcoef already used in the original question and actually asked for higher order polynomials about corrcoef . I added the actual solution to the r-squared polynomial question using statsmodels, and left the original tests, which, although off-topic, potentially useful to someone.




statsmodels has the ability to calculate r^2 polynomial statsmodels , here are 2 methods ...

 import statsmodels.api as sm import statsmodels.formula.api as smf # Construct the columns for the different powers of x def get_r2_statsmodels(x, y, k=1): xpoly = np.column_stack([x**i for i in range(k+1)]) return sm.OLS(y, xpoly).fit().rsquared # Use the formula API and construct a formula describing the polynomial def get_r2_statsmodels_formula(x, y, k=1): formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1)) data = {'x': x, 'y': y} return smf.ols(formula, data).fit().rsquared # or rsquared_adj 

To take statsmodels advantage of statsmodels , you should also take a look at the statsmodels model summary, which can be printed or displayed as an HTML table in Jupyter / IPython notepad. The result object provides access to many useful statistical metrics in addition to rsquared .

 model = sm.OLS(y, xpoly) results = model.fit() results.summary() 



Below is my original answer, where I made a comparison of different linear regression methods r ^ 2 ...

The corrcoef function used in the Question calculates the correlation coefficient r for only one linear regression, so it does not solve the question of r^2 for fitting polynomials of higher order. However, no matter what the cost, I came to the conclusion that for linear regression this is indeed the fastest and most direct method for calculating r .

 def get_r2_numpy_corrcoef(x, y): return np.corrcoef(x, y)[0, 1]**2 

These were my results from comparing several methods for 1000 random (x, y) points:

  • Pure Python (direct calculation of r )
    • 1000 cycles, best of 3: 1.59 ms per cycle
  • Numpy polyphyte (applicable to polynomials of degree n)
    • 1000 loops, the best of 3: 326 μs per loop
  • Numpy Manual (direct calculation r )
    • 10,000 loops, best of 3: 62.1 μs per loop
  • Numpy corrcoef (direct calculation of r )
    • 10,000 loops, best of 3: 56.6 μs per loop
  • Scipi (linear regression with r as output)
    • 1000 loops, best of 3: 676 μs per loop
  • Statsmodels (can do an nth degree polynomial and many other adjustments)
    • 1000 loops, best of 3: 422 μs per loop

The corrcoef method is slightly superior to calculating r ^ 2 manually using the numpy method. This is 5 times faster than the polyfit method, and 12 times faster than scipy.linregress. Just to emphasize what NumPy does for you, it is 28 times faster than pure Python. I am not very good at things like numba and pypy, so someone else would have to fill these gaps, but I think it is convincing enough for me that corrcoef is the best tool to calculate r for simple linear regression.

Here is my control code. I copied a copy from a Jupyter laptop (it's hard not to call it IPython Notebook ...), so I apologize if something broke along the way. The% timeit magic command requires IPython.

 import numpy as np from scipy import stats import statsmodels.api as sm import math n=1000 x = np.random.rand(1000)*10 x.sort() y = 10 * x + (5+np.random.randn(1000)*10-5) x_list = list(x) y_list = list(y) def get_r2_numpy(x, y): slope, intercept = np.polyfit(x, y, 1) r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1))) return r_squared def get_r2_scipy(x, y): _, _, r_value, _, _ = stats.linregress(x, y) return r_value**2 def get_r2_statsmodels(x, y): return sm.OLS(y, sm.add_constant(x)).fit().rsquared def get_r2_python(x_list, y_list): n = len(x) x_bar = sum(x_list)/n y_bar = sum(y_list)/n x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1)) y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1)) zx = [(xi-x_bar)/x_std for xi in x_list] zy = [(yi-y_bar)/y_std for yi in y_list] r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1) return r**2 def get_r2_numpy_manual(x, y): zx = (x-np.mean(x))/np.std(x, ddof=1) zy = (y-np.mean(y))/np.std(y, ddof=1) r = np.sum(zx*zy)/(len(x)-1) return r**2 def get_r2_numpy_corrcoef(x, y): return np.corrcoef(x, y)[0, 1]**2 print('Python') %timeit get_r2_python(x_list, y_list) print('Numpy polyfit') %timeit get_r2_numpy(x, y) print('Numpy Manual') %timeit get_r2_numpy_manual(x, y) print('Numpy corrcoef') %timeit get_r2_numpy_corrcoef(x, y) print('Scipy') %timeit get_r2_scipy(x, y) print('Statsmodels') %timeit get_r2_statsmodels(x, y) 
+16
Jan 05 '16 at 17:21
source share

A Wikipedia article on r-squares suggests that it can be used for general model selection, and not just for linear regression.

+5
Aug 25 '09 at 6:06
source share

Here is a function to compute a weighted r-square with Python and Numpy (most of the code comes from sklearn):

 from __future__ import division import numpy as np def compute_r2_weighted(y_true, y_pred, weight): sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64) tse = (weight * (y_true - np.average( y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64) r2_score = 1 - (sse / tse) return r2_score, sse, tse 

Example:

 from __future__ import print_function, division import sklearn.metrics def compute_r2_weighted(y_true, y_pred, weight): sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64) tse = (weight * (y_true - np.average( y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64) r2_score = 1 - (sse / tse) return r2_score, sse, tse def compute_r2(y_true, y_predicted): sse = sum((y_true - y_predicted)**2) tse = (len(y_true) - 1) * np.var(y_true, ddof=1) r2_score = 1 - (sse / tse) return r2_score, sse, tse def main(): ''' Demonstrate the use of compute_r2_weighted() and checks the results against sklearn ''' y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] weight = [1, 5, 1, 2] r2_score = sklearn.metrics.r2_score(y_true, y_pred) print('r2_score: {0}'.format(r2_score)) r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred)) print('r2_score: {0}'.format(r2_score)) r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight) print('r2_score weighted: {0}'.format(r2_score)) r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight)) print('r2_score weighted: {0}'.format(r2_score)) if __name__ == "__main__": main() #cProfile.run('main()') # if you want to do some profiling 

outputs:

 r2_score: 0.9486081370449679 r2_score: 0.9486081370449679 r2_score weighted: 0.9573170731707317 r2_score weighted: 0.9573170731707317 

This corresponds to the formula ( mirror ):

enter image description here

where f_i is the predicted fit value, y_ {av} is the average value of the observed data, y_i is the observed value of the data. w_i is the weighting applied to each data point, usually w_i = 1. SSE is the sum of squares due to error, and SST is the total sum of squares.




If you're interested, the code in R is: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f ( mirror )

+5
Aug 07 '17 at 0:55
source share

R-squared is statistics that applies only to linear regression.

Essentially, it measures how much variation in your data can be explained by linear regression.

So, you calculate the "Total Sum of Squares", which is the total square deviation of each of your result variables from their average value.,.

\ sum_ {i} (y_ {i} - y_bar) ^ 2

where y_bar is the average value of y.

Then you calculate the “sum of squares regression”, which depends on how much your FITTED values ​​differ from the average

\ sum_ {i} (yHat_ {i} - y_bar) ^ 2

and find the ratio of the two.

Now all you need to do for polynomial matching is to connect y_hat to this model, but it’s inaccurate to call it an r-square.

There is a link here that I found that speaks a little to her.

+4
May 21 '09 at 16:54
source share



All Articles