Sklearn logistic regression acceleration

I have a model that I am trying to build using LogisticRegression in sklearn , which has a couple of thousand functions and approximately 60,000 samples. I try to fit the model and it works for about 10 minutes. The machine I'm running on has gigabytes of RAM and several cores at its disposal, and I was wondering if there was a way to speed up the process.

EDIT The machine has 24 cores, and here is the output above to give an idea of โ€‹โ€‹the memory

 Processes: 94 total, 8 running, 3 stuck, 83 sleeping, 583 threads 20:10:19 Load Avg: 1.49, 1.25, 1.19 CPU usage: 4.34% user, 0.68% sys, 94.96% idle SharedLibs: 1552K resident, 0B data, 0B linkedit. MemRegions: 51959 total, 53G resident, 46M private, 676M shared. PhysMem: 3804M wired, 57G active, 1042M inactive, 62G used, 34G free. VM: 350G vsize, 1092M framework vsize, 52556024(0) pageins, 85585722(0) pageouts Networks: packets: 172806918/25G in, 27748484/7668M out. Disks: 14763149/306G read, 26390627/1017G written. 

I try to train the model as follows

 classifier = LogisticRegression(C=1.0, class_weight = 'auto') classifier.fit(train, response) 

train has lines about 3000 long (all with a floating point), and each line in response is either 0 or 1 . I have about 50,000 observations

+6
source share
4 answers

UPDATE - 2017:

In the current version, scikit-learn LogisticRegression() now has the n_jobs parameter for using multiple cores.

However, the actual text of the user manual suggests that several cores are still only used in the second half of the calculation. Starting with this update, the revised User Guide for LogisticRegression now says that njobs selects โ€œThe number of CPU cores used during the cross-validation cycleโ€, while the other two elements specified in the original answer, RandomForestClassifier() and RandomForestRegressor() , both states that njobs indicates "The number of jobs that will be executed in parallel for both matching and forecasting." In other words, the deliberate contrast in the wording here seems to indicate that the njobs parameter in LogisticRegression() , although currently implemented, is not actually implemented as completely or exactly the same as in the other two cases.

Thus, although it is now possible to speed up LogisticRegression() bit using several cores, I assume that it probably will not be very linear in proportion to the number of cores used, since this sounds like the initial โ€œfitโ€ step (the first half of the algorithm) may not give in parallelization.


Original answer:

In my opinion, the main problem here is not in memory, but in the fact that you use only one core. According to the beginning, you boot the system at 4.34%. If your logistic regression process monopolizes 1 core out of 24, then this goes up to 100/24 โ€‹โ€‹= 4.167%. Presumably, the remaining 0.17% takes into account any other processes that you also run on the machine, and they are allowed to receive an additional 0.17%, because they are planned by the system for parallel operation on a second, different core.

If you follow the links below and look at the scikit-learn API, you will see that some of the ensemble methods, such as RandomForestClassifier() or RandomForestRegressor() have an input parameter n_jobs , which directly controls the number of cores on which the package will try run in parallel. The class you use, LogisticRegression() does not define this input. The scikit-learn designers seem to have created an interface that is usually fairly consistent between classes, so if a specific input parameter is not defined for this class, this probably means that the developers simply could not find a way to implement for the meaningful class. Maybe the logistic regression algorithm is simply not suitable for parallelization; those. the potential speedup that could be achieved was not good enough to justify its implementation using a parallel architecture.

Assuming this is so, then no, you cannot make your code speed up. 24 kernels will not help you if the basic functions of the library are simply not intended for use in them.

+9
source

Try reducing the size of the data set and changing the tolerance parameter. For example, you can try classifier = LogisticRegression(tol = 0.1)

+6
source

It is worth noting that now LogisticRegression () accepts num_jobs as input and defaults to 1.

I would comment on the accepted answer, but I wouldnโ€™t have enough points.

+3
source

The default solver for LogisticRegressin in sklearn is liblinear , which is a suitable solver for regular datasets. For large datasets, use stochastic gradient descent solvers such as sag :

 model = LogisticRegression(solver='sag') 
0
source

All Articles