I use scickit-learn to configure model hyper parameters. I use a pipeline to have a preprocessing chain with an estimate. A simple version of my problem would look like this:
import numpy as np from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()), param_grid={'logisticregression__C': [0.1, 10.]}, cv=2, refit=False) _ = grid.fit(X=np.random.rand(10, 3), y=np.random.randint(2, size=(10,)))
In my case, the preprocessing (which will be in StandardScale () in the toy example) takes a lot of time, and I do not configure any parameters.
So, when I run this example, StandardScaler runs 12 times. 2 fit / predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of parameter C, it returns the same result, so it would be much more efficient to calculate it once, and then just run the evaluation part of the pipeline.
I can manually split the pipeline between preprocessing (non-configured hyper parameters) and evaluation. But in order to apply data preprocessing, I have to provide only a set of workouts. That way, I would have to implement the breaks manually, and not use GridSearchCV at all.
Is there a simple / standard way to avoid repetitive preprocessing when using GridSearchCV?
python numpy scikit-learn machine-learning grid-search
Marc garcia
source share