Use sklearn GridSearchCV with the pipeline, pre-processing once

I use scickit-learn to configure model hyper parameters. I use a pipeline to have a preprocessing chain with an estimate. A simple version of my problem would look like this:

import numpy as np from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()), param_grid={'logisticregression__C': [0.1, 10.]}, cv=2, refit=False) _ = grid.fit(X=np.random.rand(10, 3), y=np.random.randint(2, size=(10,))) 

In my case, the preprocessing (which will be in StandardScale () in the toy example) takes a lot of time, and I do not configure any parameters.

So, when I run this example, StandardScaler runs 12 times. 2 fit / predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of parameter C, it returns the same result, so it would be much more efficient to calculate it once, and then just run the evaluation part of the pipeline.

I can manually split the pipeline between preprocessing (non-configured hyper parameters) and evaluation. But in order to apply data preprocessing, I have to provide only a set of workouts. That way, I would have to implement the breaks manually, and not use GridSearchCV at all.

Is there a simple / standard way to avoid repetitive preprocessing when using GridSearchCV?

+15
python numpy scikit-learn machine-learning grid-search
source share
3 answers

Essentially, GridSearchCV is also an estimate that implements the fit () and pred () methods used by the pipeline.

So, instead of:

 grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()), param_grid={'logisticregression__C': [0.1, 10.]}, cv=2, refit=False) 

Do it:

 clf = make_pipeline(StandardScaler(), GridSearchCV(LogisticRegression(), param_grid={'logisticregression__C': [0.1, 10.]}, cv=2, refit=True)) clf.fit() clf.predict() 

What it will do, call StandardScalar () only once, for one call to clf.fit() instead of several calls, as you described.

Edit:

Updated to True when GridSearchCV is used inside the pipeline. Like mentioned in the documentation :

refit: boolean, default = True Set the best rating for the entire dataset. If "False", it is not possible to make predictions using this GridSearchCV instance after installation.

If refit = False, clf.fit() will have no effect, because the GridSearchCV object inside the pipeline will be reinitialized after fit() . When refit=True , GridSearchCV will be refitted with the best combination of evaluation parameters for all the data that is passed to fit() .

So, if you want to make a pipeline, just to see the grid search results, only then refit=False is appropriate. If you want to call the clf.predict() method, you must use refit=True , otherwise the Not Fitted error will be thrown.

+18
source share

This cannot be done in the current version of scikit-learn (0.18.1). A fix was proposed in the github project:

https://github.com/scikit-learn/scikit-learn/issues/8830

https://github.com/scikit-learn/scikit-learn/pull/8322

+2
source share

For those who stumbled upon a slightly different problem, which I also had.

Suppose you have this pipeline:

 classifier = Pipeline([ ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))), ('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))]) 

Then, when specifying the parameters, you need to include this name " clf_ ", which you used for your evaluation. Thus, the parameter grid will be:

 params={'clf__max_features':[0.3, 0.5, 0.7], 'clf__min_samples_leaf':[1, 2, 3], 'clf__max_depth':[None] } 
0
source share

All Articles