Scikit-learn sample weight broken in cross validation

Question

Scikit-learn sample weight broken in cross validation

I am trying to use weighted samples in scikit-learn while training in the Random Forest classifier. It works well when I pass sample weights directly to the classifier, for example. RandomForestClassifier().fit(X,y,sample_weight=weights) , but when I tried grid search to find the best hyperparameters for the classifier, I hit the wall:

To transfer weight using the grid option, the following is used:

 grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, fit_params={"sample_weight"=weights})

The problem is that the cross-validator is not aware of the sample weights and therefore does not overfulfill them with the actual data, so the grid_search.fit(X,y) call is not executed: the cross-validator creates subsets of X and y, sub_X and sub_y, and eventually the classifier is called using classifier.fit(sub_X, sub_y, sample_weight=weights) , but now the scales have not been re-sampled, so an exception is thrown.

At the moment, I have been working on the problem using high-end sampling samples before training the classifier, but this is temporary work. Any suggestions on how to proceed?

+7

python scikit-learn machine-learning

Roee shenberg Feb 19 '14 at 18:00

source share

3 answers

xenocyon · Answer 1 · 2014-12-03T03:05:08+0000

Edit: The ratings I see below do not look quite right. Perhaps this is due to the fact that, as mentioned above, even when the scales are used for fitting, they cannot be used in scoring.

This seems to be fixed now. I am running sklearn version 0.15.2. My code looks something like this:

 model = SGDRegressor() parameters = {'alpha':[0.01, 0.001, 0.0001]} cv = GridSearchCV(model, parameters, fit_params={'sample_weight': weights}) cv.fit(X, y)

Hope this helps (you and others who see this post).

lejlot · Answer 2 · 2014-02-19T18:43:19+0000

I would suggest writing your own choice of cross-validation options, as this is only 10-15 lines of code (especially using the kfold object from scikit-learn) in python, while oversampling is probably a big bottleneck.

milonimrod · Answer 3 · 2017-02-01T08:40:32+0000

I have too little reputation, so I can not comment on @xenocyon. I am using sklearn 0.18.1 and I am also using the pipeline in the code. The solution that worked for me was:

fit_params={'classifier__sample_weight': w} where w is the weight vector and classifier is the name of the step in the pipeline.

Scikit-learn sample weight broken in cross validation

More articles: