I was confused, achieving, it would seem, very different results, relying on a “manual” split of data between training and test sets and using the scikit-learn grid search function. I use the evaluation function obtained from the kaggle contest for both runs, and the grid search exceeds one value (the same value as manual separation). The resulting gini value is so different that somewhere there should be an error, but I do not see it, and I wonder if there is supervision that I do in comparison?
The first block of code at startup for me leads to the fact that gini has only " Validation Sample Score: 0.0033997889 (normalized gini). "
The second block (using scikit) results in significantly higher values:
Fitting 2 folds for each of 1 candidates, totalling 2 fits 0.334467621189 0.339421569449 [Parallel(n_jobs=-1)]: Done 3 out of 2 | elapsed: 9.9min remaining: -198.0s [Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 9.9min finished {'n_estimators': 1000} 0.336944643888 [mean: 0.33694, std: 0.00248, params: {'n_estimators': 1000}]
Eval Function:
def gini(solution, submission): df = zip(solution, submission) df = sorted(df, key=lambda x: (x[1],x[0]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) print normalized_gini return normalized_gini gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)
Block 1:
if __name__ == '__main__': dat=pd.read_table('train.csv',sep=",") y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1)
Block 2:
dat_dict=dat.T.to_dict().values() vectorizer = DV( sparse = False ) vectorizer.fit( dat_dict ) X = vectorizer.transform( dat_dict ) parameters= {'n_estimators': [1000]} grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring=gini_scorer,n_jobs=-1) grid_search.fit(X,y) print grid_search.best_params_ print grid_search.best_score_ print grid_search.grid_scores_
EDIT
Here is an example in which I get the same difference.
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston if __name__ == '__main__': b=load_boston() X = pd.DataFrame(b.data) y = b.target