I originally posted the tests below to recommend numpy.corrcoef , stupidly not realizing that corrcoef already used in the original question and actually asked for higher order polynomials about corrcoef . I added the actual solution to the r-squared polynomial question using statsmodels, and left the original tests, which, although off-topic, potentially useful to someone.
statsmodels has the ability to calculate r^2 polynomial statsmodels , here are 2 methods ...
import statsmodels.api as sm import statsmodels.formula.api as smf
To take statsmodels advantage of statsmodels , you should also take a look at the statsmodels model summary, which can be printed or displayed as an HTML table in Jupyter / IPython notepad. The result object provides access to many useful statistical metrics in addition to rsquared .
model = sm.OLS(y, xpoly) results = model.fit() results.summary()
Below is my original answer, where I made a comparison of different linear regression methods r ^ 2 ...
The corrcoef function used in the Question calculates the correlation coefficient r for only one linear regression, so it does not solve the question of r^2 for fitting polynomials of higher order. However, no matter what the cost, I came to the conclusion that for linear regression this is indeed the fastest and most direct method for calculating r .
def get_r2_numpy_corrcoef(x, y): return np.corrcoef(x, y)[0, 1]**2
These were my results from comparing several methods for 1000 random (x, y) points:
- Pure Python (direct calculation of
r )- 1000 cycles, best of 3: 1.59 ms per cycle
- Numpy polyphyte (applicable to polynomials of degree n)
- 1000 loops, the best of 3: 326 μs per loop
- Numpy Manual (direct calculation
r )- 10,000 loops, best of 3: 62.1 μs per loop
- Numpy corrcoef (direct calculation of
r )- 10,000 loops, best of 3: 56.6 μs per loop
- Scipi (linear regression with
r as output)- 1000 loops, best of 3: 676 μs per loop
- Statsmodels (can do an nth degree polynomial and many other adjustments)
- 1000 loops, best of 3: 422 μs per loop
The corrcoef method is slightly superior to calculating r ^ 2 manually using the numpy method. This is 5 times faster than the polyfit method, and 12 times faster than scipy.linregress. Just to emphasize what NumPy does for you, it is 28 times faster than pure Python. I am not very good at things like numba and pypy, so someone else would have to fill these gaps, but I think it is convincing enough for me that corrcoef is the best tool to calculate r for simple linear regression.
Here is my control code. I copied a copy from a Jupyter laptop (it's hard not to call it IPython Notebook ...), so I apologize if something broke along the way. The% timeit magic command requires IPython.
import numpy as np from scipy import stats import statsmodels.api as sm import math n=1000 x = np.random.rand(1000)*10 x.sort() y = 10 * x + (5+np.random.randn(1000)*10-5) x_list = list(x) y_list = list(y) def get_r2_numpy(x, y): slope, intercept = np.polyfit(x, y, 1) r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1))) return r_squared def get_r2_scipy(x, y): _, _, r_value, _, _ = stats.linregress(x, y) return r_value**2 def get_r2_statsmodels(x, y): return sm.OLS(y, sm.add_constant(x)).fit().rsquared def get_r2_python(x_list, y_list): n = len(x) x_bar = sum(x_list)/n y_bar = sum(y_list)/n x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1)) y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1)) zx = [(xi-x_bar)/x_std for xi in x_list] zy = [(yi-y_bar)/y_std for yi in y_list] r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1) return r**2 def get_r2_numpy_manual(x, y): zx = (x-np.mean(x))/np.std(x, ddof=1) zy = (y-np.mean(y))/np.std(y, ddof=1) r = np.sum(zx*zy)/(len(x)-1) return r**2 def get_r2_numpy_corrcoef(x, y): return np.corrcoef(x, y)[0, 1]**2 print('Python') %timeit get_r2_python(x_list, y_list) print('Numpy polyfit') %timeit get_r2_numpy(x, y) print('Numpy Manual') %timeit get_r2_numpy_manual(x, y) print('Numpy corrcoef') %timeit get_r2_numpy_corrcoef(x, y) print('Scipy') %timeit get_r2_scipy(x, y) print('Statsmodels') %timeit get_r2_statsmodels(x, y)