I am trying to use GridSearchCV from sklearn in Python to find options for the SVM classifier. The training data has the form (15750, 655536) (15750 samples, object dimension: 65536).
Everything works fine with the default setting! However, if I want to use the parallel processing parameter, defining n_jobs, I encounter the following problem: data is loaded into memory (on a machine with 48 GB of RAM, it takes about 14% of the total memory), but it never starts gridsearch / training! In (h), the top status of the process is S (therefore it is basically stopped!). It continues to occupy memory, but never starts (CPU usage remains zero!).
I tried different values โโfor n_jobs, for example 2,3-5 (the machine has 8 cores). But no luck! According to the documentation, with big data, the pre_dispatch option can be used in GridSearchCV, so the number of copied data is limited and memory problems are eliminated. So I tried even with n_job = 2 and pre_dispatch = 1, and still nothing works!
I should also mention that I tried the same code with much fewer samples, for example 1000 of them, and again everything was fine! However, the question arises, considering that for one process the data just occupies 15% of the machine memory, why cannot it work on at least two cores, with pre_dispatch = 2 ?? Then it should occupy about 30% of the machineโs memory. But why is the process just stopped? And even a memory error? And if there is a way around this?
Here is a snippet of code to complete the task (taken mainly from the sklearn documentation):
sklearn version: 0.12.1 and python version: 2.7.3
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] clf = GridSearchCV(SVC(C=1), tuned_parameters, n_jobs=2, verbose=3, pre_dispatch=1) clf.fit(tr, tt, cv=3)