Scikit-learn: general question about parallel computing

Question

Scikit-learn: general question about parallel computing

I would like to use sklearn.grid_search.GridSearchCV() for multiple processors in parallel. This is the first time I will do this, but my initial tests show that it works.

I am trying to understand this part of the documentation:

n_jobs : int, default 1
The number of jobs for parallel work.
pre_dispatch : int or string, optional
Controls the number of jobs that are sent during concurrency execution. Reducing this number can be useful to avoid exploding memory consumption when sending more jobs than processors can handle. This parameter can be:
No, in which case all tasks are immediately created and spawned. Use this for lightweight and fast-running jobs to avoid delays due to spawning of irregular jobs. An int, the exact number of complete jobs that spawn. A string giving an expression as a function of n_jobs, as in '2 * n_jobs

Can someone break this for me? I find it hard to understand the difference between n_jobs and pre_dispatch . If I set n_jobs = 5 and pre-dispatch=2 , how does this differ from setting n_jobs=2 ?

+6

scikit-learn

Feish Sep 19 '15 at 19:05

source share

2 answers

Michael · Answer 1 · 2016-03-03T15:54:17+0000

Suppose you make KNN and must choose between k=[1,2,3,4,5, ... 1000] . Even if you set n_jobs=2 , GridSearchCV will first create 1000 jobs, each of which will have one choice of your k , and also make 1000 copies of your data (perhaps blow up your memory if your data is large) and then send these 1000 jobs up to 2 processors (most jobs will, of course, be expected). GridSearchCV does not just generate 2 jobs for two processors, because the on-demand sawing process is expensive. It directly creates an equal number of tasks in the form of combinations of parameters (1000 in this case). In this sense, n_jobs may be misleading. Now, using pre_dispatch , you can set how many predefined tasks you want to create.

rightskewed · Answer 2 · 2015-09-19T19:17:13+0000

A source

If n_jobs has been set to a value higher than one, the data is copied for each parameter (not n_jobs times). This is done for reasons of efficiency, if individual tasks take very little time, but can cause errors if the data set is large and there is not enough memory. The workaround in this case is to set pre_dispatch. Then the memory is only copied pre_dispatch many times. A reasonable value for pre_dispatch is 2 * n_jobs.

Scikit-learn: general question about parallel computing

More articles: