How to use scikit preprocessing / normalization along with cross validation?

As an example of cross-validation without preprocessing, I can do something like this:

tuned_params = [{"penalty" : ["l2", "l1"]}] from sklearn.linear_model import SGDClassifier SGD = SGDClassifier() from sklearn.grid_search import GridSearchCV clf = GridSearchCV(myClassifier, params, verbose=5) clf.fit(x_train, y_train) 

I would like to pre-process my data using something like

 from sklearn import preprocessing x_scaled = preprocessing.scale(x_train) 

But it would be nice to do this before setting up cross-validation, because then the training and test sets will be normalized together. How to configure cross-validation to preprocess the corresponding training and testing sets separately for each run?

+7
python scikit-learn
source share
1 answer

In the documentation, if you use Pipeline , this can be done for you. From the docs , just above section 3.1.1.1, the underscore is mine:

In the same way, it is important to test the predictor on the data that has been trained, preprocessing (for example, standardization, choice of functions, etc.) and similar data transformations should be similarly studied from a set of trainings and applied to the data for prediction [... ] A Pipeline simplifies grading by providing this cross-validation behavior. []

More relevant information about the available pipelines here .

+5
source share

All Articles