Python Multiple Simple Linear Regression

Please note that this is not a question of multiple regression, it is a question of repeating several times the regression of a simple (single variable) in Python / NumPy (2.7).

I have two arrays mxn x and y . The lines correspond to each other, and each pair is a set of (x, y) points for measurement. That is, plt.plot(xT, yT, '.') Will display each of the m datasets / dimensions.

I am wondering what is the best way to perform m linear regressions. I am currently scipy.stats.linregress() through strings and using scipy.stats.linregress() . (Suppose I do not want solutions based on linear algebra with matrices, but instead want to work with this function or the equivalent black box function.) I could try np.vectorize , but the docs point out that these are also loops.

In some experiments, I also found a way to use list methods with map() and get the correct results. I put both solutions below. IPython returns `%% timeit`` using a small data set (commented out):

 (loop) 1000 loops, best of 3: 642 ยตs per loop (map) 1000 loops, best of 3: 634 ยตs per loop 

To try to increase this, I made a much larger random dataset ( trials x trials size):

 (loop, trials = 1000) 1 loops, best of 3: 299 ms per loop (loop, trials = 10000) 1 loops, best of 3: 5.64 s per loop (map, trials = 1000) 1 loops, best of 3: 256 ms per loop (map, trials = 10000) 1 loops, best of 3: 2.37 s per loop 

This is a decent acceleration on a really large set, but I expected a bit more. Is there a better way?

 import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats np.random.seed(42) #y = np.array(((0,1,2,3),(1,2,3,4),(2,4,6,8))) #x = np.tile(np.arange(4), (3,1)) trials = 1000 y = np.random.rand(trials,trials) x = np.tile(np.arange(trials), (trials,1)) num_rows = shape(y)[0] slope = np.zeros(num_rows) inter = np.zeros(num_rows) for k, xrow in enumerate(x): yrow = y[k,:] slope[k], inter[k], t1, t2, t3 = stats.linregress(xrow, yrow) #plt.plot(xT, yT, '.') #plt.hold = True #plt.plot(xT, xT*slope + intercept) # Can the loop be removed? tempx = [x[k,:] for k in range(num_rows)] tempy = [y[k,:] for k in range(num_rows)] results = np.array(map(stats.linregress, tempx, tempy)) slope_vec = results[:,0] inter_vec = results[:,1] #plt.plot(xT, yT, '.') #plt.hold = True #plt.plot(xT, xT*slope_vec + inter_vec) print "Slopes equal by both methods?: ", np.allclose(slope, slope_vec) print "Inters equal by both methods?: ", np.allclose(inter, inter_vec) 
+6
source share
1 answer

Linear regression with a single variable is simple enough to manually vectorize it:

 def multiple_linregress(x, y): x_mean = np.mean(x, axis=1, keepdims=True) x_norm = x - x_mean y_mean = np.mean(y, axis=1, keepdims=True) y_norm = y - y_mean slope = (np.einsum('ij,ij->i', x_norm, y_norm) / np.einsum('ij,ij->i', x_norm, x_norm)) intercept = y_mean[:, 0] - slope * x_mean[:, 0] return np.column_stack((slope, intercept)) 

With some data compiled:

 m = 1000 n = 1000 x = np.random.rand(m, n) y = np.random.rand(m, n) 

it surpasses your loop parameters with a fair share:

 %timeit multiple_linregress(x, y) 100 loops, best of 3: 14.1 ms per loop 
+2
source

All Articles