Scikit-learn: parallel stochastic gradient descent

Question

Scikit-learn: parallel stochastic gradient descent

I have a fairly large learning matrix (more than 1 billion rows, two functions for each row). There are two classes (0 and 1). This is too large for a single machine, but, fortunately, I have about 200 MPI hosts at my disposal. Each of them is a modest dual-core workstation.

Function generation has already been successfully distributed.

In Multiprocessing scikit-learn answers, you can spread the work of SGDClassifier:

You can distribute datasets across the kernels, do partial_fit, get weight vectors, average them, distribute them by estimation methods, do a partial fit again.

When I ran partial_fit a second time on each assessment, where can I go from there to get the final aggregate estimate?

My best guess was to average the odds and intercepts again and make an estimate with these values. The resulting estimate gives a different result than the estimate constructed with fit () for all the data.

More details

Each host generates a local matrix and a local vector. These are n lines of a test set and corresponding n target values.

Each host uses a local matrix and local vector to create an SGDClassifier and performs a partial fit. Then each one sends the coef vector and the interception to the root. Roots average these values and send them back to the hosts. Hosts do another partial_fit and send the coef vector and interception to root.

The root is building a new estimate with these values.

local_matrix = get_local_matrix() local_vector = get_local_vector() estimator = linear_model.SGDClassifier() estimator.partial_fit(local_matrix, local_vector, [0,1]) comm.send((estimator.coef_,estimator.intersept_),dest=0,tag=rank) average_coefs = None avg_intercept = None comm.bcast(0,root=0) if rank > 0: comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank) else: pairs = [comm.recv(source=r, tag=r) for r in range(1,size)] pairs.append( (estimator.coef_, estimator.intercept_) ) average_coefs = np.average([ a[0] for a in pairs ],axis=0) avg_intercept = np.average( [ a[1][0] for a in pairs ] ) estimator.coef_ = comm.bcast(average_coefs,root=0) estimator.intercept_ = np.array( [comm.bcast(avg_intercept,root=0)] ) estimator.partial_fit(metric_matrix, edges_exist,[0,1]) if rank > 0: comm.send( (estimator.coef_, estimator.intercept_ ), dest=0, tag=rank) else: pairs = [comm.recv(source=r, tag=r) for r in range(1,size)] pairs.append( (estimator.coef_, estimator.intercept_) ) average_coefs = np.average([ a[0] for a in pairs ],axis=0) avg_intercept = np.average( [ a[1][0] for a in pairs ] ) estimator.coef_ = average_coefs estimator.intercept_ = np.array( [avg_intercept] ) print("The estimator at rank 0 should now be working")

Thanks!

+6

python scikit-learn parallel-processing machine-learning mpi

James treanor Jan 10 '14 at 18:47

source share

2 answers

What you are experiencing is normal and expected. First, the fact that using SGD means that you will never achieve the exact result. You quickly converge to the optimal solution (since this is a convex problem), and then move around the area for the remainder. Different runs with the entire data set should give different results each time.

Where can I go from there to get the final cumulative grade?

In theory, you just keep doing this over and over until you are happy with the convergence. Totally unnecessary for what you do. Other systems are switching to using more sophisticated methods (for example, L-BFGS) to converge to the final solution now that they have a good “warm start” to the solution. However, this will not give you any serious successes exactly (I think maybe getting a whole interest rate if you are lucky), so do not assume that this is a break. Think what it is, fine-tuning.

Secondly, the fact that linear models do not parallel well. Despite the claims of vowpalwabbit and some other libraries, you are not going to get linear scaling from learning a linear model in parallel. Just averaging the intermediate results is a bad way to parallelize such a system, and, unfortunately, this is as good as for training linear models in parallel.

The fact is that you have only 2 functions. You should be able to easily prepare much more complex models using only a smaller portion of your data. 1 billion lines is an excess for just two functions.

+5

Raff.edward Jan 11 '14 at 1:43

source share

ogrisel · Accepted Answer · 2014-01-10T21:59:31+0000

Training a linear model on a data set using 1e9 samples and two functions is very likely to adjust or waste CPU / IO time if the data is actually linearly divided. Do not waste time thinking about parallelizing such a problem with a linear model:

or switch to a more complex class of models (for example, organize random forests into smaller sections of data that fit into memory and summarize them)
or either select random subsamples of your dataset to enlarge and train linear models. Measure the accuracy of the forecast on the test and stop when you see a diminishing return (perhaps after a couple of tens of thousands of minority class samples).

Scikit-learn: parallel stochastic gradient descent

More details

More articles: