Scikit-Learn One-hot-encode before or after split train / test

Question

Scikit-Learn One-hot-encode before or after split train / test

I am considering two scenarios for building a model using scikit-learn, and I cannot understand why one of them returns a result that is so fundamentally different from the other. The only thing that differs between the two cases (what I know) is that in one case, I code the categorical variables one-time at once (according to all the data), and then divide between training and test. In the second case, I divide between training and test, and then one-hot coding of both sets based on training data.

The latter case is technically better for evaluating the generalization error of the process, but this case returns a normalized genie, which is very different (and bad - essentially no model) compared to the first case. I know that the first case of gini (~ 0.33) corresponds to a model built on this data.

Why does the second case return such a different gini? FYI A dataset contains a combination of numeric and categorical variables.

Method 1 (encoding whole data once, then splitting) Returns: Validation Sample Score: 0.3454355044 (normalized gini).

 from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston def gini(solution, submission): df = zip(solution, submission, range(len(solution))) df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) return normalized_gini # Normalized Gini Scorer gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True) if __name__ == '__main__': dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",") y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1) folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test #First one hot and make a pandas df dat_dict=dat.T.to_dict().values() vectorizer = DV( sparse = False ) vectorizer.fit( dat_dict ) dat= vectorizer.transform( dat_dict ) dat=pd.DataFrame(dat) train_X=dat.iloc[folds[0],:] train_y=y[folds[0]] test_X=dat.iloc[folds[1],:] test_y=y[folds[1]] rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))

Method 2 (first split and then single line code) . This returns: Validation Sample Score: 0.0055124452 (normalized gini).

 from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston def gini(solution, submission): df = zip(solution, submission, range(len(solution))) df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) return normalized_gini # Normalized Gini Scorer gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True) if __name__ == '__main__': dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",") y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1) folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test #first split train_X=dat.iloc[folds[0],:] train_y=y[folds[0]] test_X=dat.iloc[folds[1],:] test_y=y[folds[1]] #One hot encode the training X and transform the test X dat_dict=train_X.T.to_dict().values() vectorizer = DV( sparse = False ) vectorizer.fit( dat_dict ) train_X= vectorizer.transform( dat_dict ) train_X=pd.DataFrame(train_X) dat_dict=test_X.T.to_dict().values() test_X= vectorizer.transform( dat_dict ) test_X=pd.DataFrame(test_X) rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))

+7

python-2.7 scikit-learn

B_miner Jul 19 '15 at 23:24

source share

2 answers

I can't get your code to work, but I guess in the test dataset

you do not see all the levels of some categorical variables, and therefore, if you calculate your dummy variables only from this data, you will actually have different columns.
Otherwise, maybe you have the same columns, but they are in a different order?

+3

maxymoo Jul 20 '15 at 3:34

source share

inversion · Accepted Answer · 2015-07-20T22:52:05+0000

While the previous comments correctly suggest that it is best to display the entire space of your function, in your case both Train and Test contain all the function values in all columns.

If you compare vectorizer.vocabulary_ between the two versions, they are exactly the same, so there is no difference in the comparison. Therefore, this cannot cause problems.

The reason Method 2 fails is because your dat_dict gets re-sorted by the source index when this command is executed.

 dat_dict=train_X.T.to_dict().values()

In other words, train_X has a shuffled index included in this line of code. When you turn it into a dict , the order of the dict re-sorted in the numerical order of the original index. This causes your train and test data to be fully decorrelated with y .

Method 1 does not suffer from this problem, because you shuffle the data after display.

You can fix the problem by adding .reset_index() both times when you assign dat_dict in method 2, for example,

 dat_dict=train_X.reset_index(drop=True).T.to_dict().values()

This ensures that the data order is preserved when converting to a dict .

When I add this bit of code, I get the following results:
- Method 1: Validation Evaluation Example: 0.3454355044 (Normalized Gini)
- Method 2: Validation Evaluation Example: 0.3438430991 (normalized gini)

Scikit-Learn One-hot-encode before or after split train / test

More articles: