Manual separation compared to Scikit Grid

Question

Manual separation compared to Scikit Grid

I was confused, achieving, it would seem, very different results, relying on a “manual” split of data between training and test sets and using the scikit-learn grid search function. I use the evaluation function obtained from the kaggle contest for both runs, and the grid search exceeds one value (the same value as manual separation). The resulting gini value is so different that somewhere there should be an error, but I do not see it, and I wonder if there is supervision that I do in comparison?

The first block of code at startup for me leads to the fact that gini has only " Validation Sample Score: 0.0033997889 (normalized gini). "

The second block (using scikit) results in significantly higher values:

 Fitting 2 folds for each of 1 candidates, totalling 2 fits 0.334467621189 0.339421569449 [Parallel(n_jobs=-1)]: Done 3 out of 2 | elapsed: 9.9min remaining: -198.0s [Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 9.9min finished {'n_estimators': 1000} 0.336944643888 [mean: 0.33694, std: 0.00248, params: {'n_estimators': 1000}]

Eval Function:

 def gini(solution, submission): df = zip(solution, submission) df = sorted(df, key=lambda x: (x[1],x[0]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) print normalized_gini return normalized_gini gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)

Block 1:

 if __name__ == '__main__': dat=pd.read_table('train.csv',sep=",") y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1) #sample out 30% for validation folds=train_test_split(range(len(y)),test_size=0.3) #30% test train_X=dat.iloc[folds[0],:] train_y=y[folds[0]] test_X=dat.iloc[folds[1],:] test_y=y[folds[1]] #assume no leakage by OH whole data dat_dict=train_X.T.to_dict().values() vectorizer = DV( sparse = False ) vectorizer.fit( dat_dict ) train_X = vectorizer.transform( dat_dict ) del dat_dict dat_dict=test_X.T.to_dict().values() test_X = vectorizer.transform( dat_dict ) del dat_dict rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print "Validation Sample Score: %.10f (normalized gini)." % normalized_gini(test_y,y_submission)

Block 2:

 dat_dict=dat.T.to_dict().values() vectorizer = DV( sparse = False ) vectorizer.fit( dat_dict ) X = vectorizer.transform( dat_dict ) parameters= {'n_estimators': [1000]} grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring=gini_scorer,n_jobs=-1) grid_search.fit(X,y) print grid_search.best_params_ print grid_search.best_score_ print grid_search.grid_scores_

EDIT

Here is an example in which I get the same difference.

 from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston if __name__ == '__main__': b=load_boston() X = pd.DataFrame(b.data) y = b.target #sample out 30% for validation folds=train_test_split(range(len(y)),test_size=0.5) #50% test train_X=X.iloc[folds[0],:] train_y=y[folds[0]] test_X=X.iloc[folds[1],:] test_y=y[folds[1]] rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print "Validation Sample Score: %.10f (mean squared)." % mean_squared_error(test_y,y_submission) parameters= {'n_estimators': [1000]} grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring='mean_squared_error',n_jobs=-1) grid_search.fit(X,y) print grid_search.best_params_ print grid_search.best_score_ print grid_search.grid_scores_

+7

python scikit-learn machine-learning

B_miner Jul 13 '15 at 15:43

source share

4 answers

ldirer · Answer 1 · 2015-07-17T16:33:56+0000

Not sure if I can provide you with a complete solution, but here are a few pointers:

Use the random_state parameter of scikit-learn objects when debugging this type of problem, since your results are actually reproducing. The following will always return the exact same number:
```
 rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) mean_squared_error(test_y,y_submission) 
```

It resets the random number generator to make sure you always get "the same randomness." You should use it on train_test_split and GridSearchCV too.

The results you get with a self-sufficient example are normal. I usually get:

 Validation Sample Score: 9.8136434847 (mean squared). [mean: -22.38918, std: 11.56372, params: {'n_estimators': 1000}]

First, note that the rms error returned from GridSearchCV is a negative rms error. I think it is by design to maintain the spirit of the evaluation function (for evaluation, more is better).

Now it is 9.81 against 22.38. However, here the standard deviation is HUGE. This may explain that the ratings look so different. If you want to check that GridSearchCV not doing something dubious, you can force it to use only one split and the same as your split manually:

 from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston if __name__ == '__main__': b=load_boston() X = pd.DataFrame(b.data) y = b.target folds=train_test_split(range(len(y)),test_size=0.5, random_state=15) #50% test folds_split = np.ones_like(y) folds_split[folds[0]] = -1 ps = PredefinedSplit(folds_split) for tr, te in ps: train_X=X.iloc[tr,:] train_y=y[tr] test_X=X.iloc[te,:] test_y=y[te] rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print("Validation Sample Score: {:.10f} (mean squared).".format(mean_squared_error(test_y, y_submission))) parameters= {'n_estimators': [1000], 'n_jobs': [1], 'random_state': [15]} grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=ps, verbose=2, scoring='mean_squared_error', n_jobs=1) grid_search.fit(X,y) print("best_params: ", grid_search.best_params_) print("best_score", grid_search.best_score_) print("grid_scores", grid_search.grid_scores_)

Hope this helps a bit.

Sorry, I can’t understand what’s going on with your scorer Gini. I would say that 0.0033xxx seems to be a very low value (almost no model at all?) For a normalized genie count.

Challensois · Answer 2 · 2015-07-21T09:54:43+0000

Following your minimal example and answer from user3914041 and Andreus, this works as intended. Indeed, I received:

 Validation Sample Score: 10.176958 (mean squared). Fitting 1 folds for each of 1 candidates, totalling 1 fits mean: 10.19074, std: 0.00000, params: {'n_estimators': 1000}

In this case, we have the same result in both methodologies (excluding some rounding). Here is the code that reproduces the same grades:

 from sklearn.cross_validation import train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn import metrics from sklearn.grid_search import GridSearchCV from sklearn.metrics import mean_squared_error, make_scorer from sklearn.datasets import load_boston b=load_boston() X = b.data y = b.target folds=train_test_split(range(len(y)),test_size=0.5, random_state=10) train_X=X[folds[0],:] train_y=y[folds[0]] test_X=X[folds[1],:] test_y=y[folds[1]] folds_split = np.zeros_like(y) folds_split[folds[0]] = -1 ps = PredefinedSplit(folds_split) rf=RandomForestRegressor(n_estimators=1000, random_state=42) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print "Validation Sample Score: %f (mean squared)." % mean_squared_error(test_y,y_submission) mse_scorer = make_scorer(mean_squared_error) parameters= {'n_estimators': [1000]} grid_search = GridSearchCV(RandomForestRegressor(random_state=42), cv=ps, param_grid=parameters, verbose=1, scoring=mse_scorer) grid_search.fit(X,y) print grid_search.grid_scores_[0]

In the first example, try deleting greater_is_better=True . Indeed, the Gini coefficient should be minimized, not maximized.

Try to find out if this solves the problem. You can also add random seed to ensure that your split is done in exactly the same way.

Andreus · Answer 3 · 2015-07-14T21:24:25+0000

There is one difference between the two blocks of code that I can tell. Using cv=2 , you break the data into two pieces of 50% size. Then, the resulting gini value is averaged between them.

As a side note, are you sure you want greater_is_better=True in your scorer? From your message, you mean that you want to lower this score. Be extremely careful in this matter, as GridSearchCV maximizes the score.

From the GridSearchCV documentation :

The selected parameters are those that maximize the estimate of the remaining data, unless an explicit estimate is transmitted, in which case it is used.

Drew abbot · Answer 4 · 2015-08-13T11:18:20+0000

This thread is pretty old, so I assume you understood all this, but for clarity, there were at least 3 problems in the original 2 blocks, which gave different results: in short, refusing to set a couple of random seeds and not using PredefinedSplit on the folds returned by train_test_split (an iteration over which reordering of sections may end). Here is a separate code to illustrate using another gini implementation:

 import sys import numpy as np import pandas as pd from sklearn.cross_validation import train_test_split, PredefinedSplit from sklearn.feature_extraction import DictVectorizer as DV from sklearn.grid_search import GridSearchCV from sklearn.ensemble import RandomForestRegressor from sklearn import metrics def gini(expected, predicted): assert expected.shape[0] == predicted.shape[0], 'unequal number of rows: [ %d vs %d ]' \ % ( expected.shape[0] == predicted.shape[0] ) _all = np.asarray(np.c_[ expected, predicted, np.arange(expected.shape[0])], dtype=np.float) _EXPECTED = 0 _PREDICTED = 1 _INDEX = 2 # sort by predicted descending, then by index ascending sort_order = np.lexsort((_all[:, _INDEX], -1 * _all[:, _PREDICTED])) _all = _all[sort_order] total_losses = _all[:, _EXPECTED].sum() gini_sum = _all[:, _EXPECTED].cumsum().sum() / total_losses gini_sum -= (expected.shape[0] + 1.0) / 2.0 return gini_sum / expected.shape[0] def gini_normalized(solution, submission, gini=gini): solution = np.array(solution) submission = np.array(submission) return gini(solution, submission) / gini(solution, solution) gini_scorer = metrics.make_scorer( gini_normalized, greater_is_better=True ) dat=pd.read_table('train.csv',sep=',') y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1) # 1. set seed for train_test_split() folds = train_test_split( range(len(y)), test_size=0.7, random_state=15 ) # 70% test dat_dict=dat.T.to_dict().values() vectorizer=DV( sparse = False ) vectorizer.fit( dat_dict ) dat=vectorizer.transform( dat_dict ) dat=pd.DataFrame(dat) # 2. instead of using the raw folds returned by train_test_split, # use the PredefinedSplit iterator, just like GridSearchCV does if 0: train_X=dat.iloc[folds[0]] train_y=y[folds[0]] test_X=dat.iloc[folds[1]] test_y=y[folds[1]] else: folds_split = np.zeros_like(y) folds_split[folds[0]] = -1 ps = PredefinedSplit(folds_split) # in this example, there only one iteration here for train_index, test_index in ps: train_X, test_X = dat.iloc[train_index], dat.iloc[test_index] train_y, test_y = y[train_index], y[test_index] n_estimators = [ 100, 200 ] # 3. also set seed for RFR rfr_params = { 'n_jobs':7, 'random_state':15 } ###################################################################### # manual grid search ( block 1 ) for n_est in n_estimators: print 'n_estimators = %d:' % n_est; sys.stdout.flush() rfr = RandomForestRegressor( n_estimators=n_est, **rfr_params ) rfr.fit( train_X, train_y ) y_pred = rfr.predict( test_X ) gscore = gini_normalized( test_y, y_pred ) print ' validation score: %.5f (normalized gini)' % gscore ###################################################################### # GridSearchCV grid search ( block 2 ) ps = PredefinedSplit(folds_split) rfr = RandomForestRegressor( **rfr_params ) grid_params = { 'n_estimators':n_estimators } gcv = GridSearchCV( rfr, grid_params, scoring=gini_scorer, cv=ps ) gcv.fit( dat, y ) print gcv.grid_scores_

Manual separation compared to Scikit Grid

More articles: