Problems with binary one-time (one-K) coding in python

One-time binary (also known as single-coordinate) coding is to create one binary column for each individual value for a categorical variable. For example, if you have a color column (categorical variable) that accepts the values ​​red, blue, yellow, and unknown, then binary single-line coding replaces the color column with the binary columns "color = red ',' color = blue 'and' color = yellow '. I start with the data in the pandas data frame and I want to use this data to train the model using scikit-learn. I know two ways to do binary single-string coding, but none of them are satisfactory for me.

  • Pandas and get_dummies in the categorical columns of a data frame. This method seems to be excellent, since the original data frame contains all available data. That is, you do hot coding before dividing your data into training, validation and test sets. However, if the data is already separated in different sets, this method does not work very well. What for? Since one of the data sets (say, a test set) may contain fewer values ​​for this variable. For example, it may happen that while a training set contains red, blue, yellow, and unknown values ​​for a variable color, a test set contains only red and blue. Thus, the test set will have fewer columns than the training set. (I don’t know how the new columns are sorted, and even if they have the same columns, this may be in a different order in each set).

  • Sklearn and DictVectorizer . This solves the previous problem, since we can make sure that we apply the same transformations to the test suite. However, the result of the conversion is a numpy array instead of a pandas frame frame. If we want to restore the result as a pandas data frame, we need (or at least it is): 1) pandas.DataFrame (data = DictVectorizer conversion result, index = index of the original pandas frame, columns = DictVectorizer (). get_features_names) and 2) combine by index the resulting data frame with the original one containing numerical columns. It works, but it is somewhat cumbersome.

Is there a better way to do binary hot coding in a pandas data frame if we separate our data into a workout and a test suite?

+7
python pandas scikit-learn categorical-data
source share
2 answers

If your columns are in the same order, you can combine dfs, use get_dummies , and then split them again, e.g.

 encoded = pd.get_dummies(pd.concat([train,test], axis=0)) train_rows = train.shape[0] train_encoded = encoded.iloc[:train_rows, :] test_encoded = encoded.iloc[train_rows:, :] 

If your columns do not match the order, then you will have problems no matter what method you are trying to execute.

+9
source share

The data type can be categorically configured:

 In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]}) In [6]: df_train Out[6]: car color 0 seat red 1 bmw green In [7]: pd.get_dummies(df_train ) Out[7]: car_seat car_bmw car_mercedes color_green color_red 0 1 0 0 0 1 1 0 1 0 1 0 

See this Pandas question .

0
source share

All Articles