Store the same dummy variable in training and testing data

Question

Store the same dummy variable in training and testing data

I am creating a prediction model in python with two separate sets for training and testing. The training data contains a categorical variable of a numerical type, for example, a zip code, [91521,23151,12355, ...], as well as string categorical variables, for example, a city [Chicago, New York, Los Angeles , ...].

To train the data, I first use "pd.get_dummies" to get a dummy variable for this variable, and then fit the model with the converted training data.

I do the same conversion in my test data and predict the result using a trained model. However, I got the error 'ValueError: the number of model functions must match the input. The n_features model is 1487, and the n_features input is 1345 ' . The reason is that there are fewer dummy variables in the test data, because it has less “city” and “index”.

How can I solve this problem? For example, "OneHotEncoder" will encode only a categorical variable of a numerical type. 'DictVectorizer ()' will encode only a categorical variable of type string. I search on the line and see several similar questions, but none of them really affect my question.

Handling categorical functions with scikit-learn

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python

+8

python scikit-learn dataframe dummy-variable prediction

nimning Dec 26 '16 at 19:54

source share

3 answers

Thibault clement · Answer 1 · 2017-07-28T04:59:05+0000

You can also just get the missing columns and add them to the test dataset:

# Get missing columns in the training test missing_cols = set( train.columns ) - set( test.columns ) # Add a missing column in test set with default value equal to 0 for c in missing_cols: test[c] = 0 # Ensure the order of column in the test set is in the same order than in train set test = test[train.columns]

This code also ensures that a column retrieved from a category in the test data set but not contained in the training dataset is deleted

Eduard Ilyasov · Answer 2 · 2016-12-27T04:34:50+0000

Suppose you have identical function names in a train and test dataset. You can generate a concatenated dataset from a train and test, get dummies from a concatenated dataset, and divide it into training and testing.

You can do it as follows:

 import pandas as pd train = pd.DataFrame(data = [['a', 123, 'ab'], ['b', 234, 'bc']], columns=['col1', 'col2', 'col3']) test = pd.DataFrame(data = [['c', 345, 'ab'], ['b', 456, 'ab']], columns=['col1', 'col2', 'col3']) train_objs_num = len(train) dataset = pd.concat(objs=[train, test], axis=0) dataset_preprocessed = pd.get_dummies(dataset) train_preprocessed = dataset_preprocessed[:train_objs_num] test_preprocessed = dataset_preprocessed[train_objs_num:]

As a result, you have an equal number of functions for data collection for trains and tests.

user1482030 · Answer 3 · 2017-11-11T16:50:05+0000

 train2,test2 = train.align(test, join='outer', axis=1, fill_value=0)

train2 and test2 have the same columns. Fill_value indicates the value used for missing columns.

Store the same dummy variable in training and testing data

More articles: