I am considering two scenarios for building a model using scikit-learn, and I cannot understand why one of them returns a result that is so fundamentally different from the other. The only thing that differs between the two cases (what I know) is that in one case, I code the categorical variables one-time at once (according to all the data), and then divide between training and test. In the second case, I divide between training and test, and then one-hot coding of both sets based on training data.
The latter case is technically better for evaluating the generalization error of the process, but this case returns a normalized genie, which is very different (and bad - essentially no model) compared to the first case. I know that the first case of gini (~ 0.33) corresponds to a model built on this data.
Why does the second case return such a different gini? FYI A dataset contains a combination of numeric and categorical variables.
Method 1 (encoding whole data once, then splitting) Returns: Validation Sample Score: 0.3454355044 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston def gini(solution, submission): df = zip(solution, submission, range(len(solution))) df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) return normalized_gini
Method 2 (first split and then single line code) . This returns: Validation Sample Score: 0.0055124452 (normalized gini).
from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston def gini(solution, submission): df = zip(solution, submission, range(len(solution))) df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) return normalized_gini