Store the same dummy variable in training and testing data

I am creating a prediction model in python with two separate sets for training and testing. The training data contains a categorical variable of a numerical type, for example, a zip code, [91521,23151,12355, ...], as well as string categorical variables, for example, a city [Chicago, New York, Los Angeles , ...].

To train the data, I first use "pd.get_dummies" to get a dummy variable for this variable, and then fit the model with the converted training data.

I do the same conversion in my test data and predict the result using a trained model. However, I got the error 'ValueError: the number of model functions must match the input. The n_features model is 1487, and the n_features input is 1345 ' . The reason is that there are fewer dummy variables in the test data, because it has less โ€œcityโ€ and โ€œindexโ€.

How can I solve this problem? For example, "OneHotEncoder" will encode only a categorical variable of a numerical type. 'DictVectorizer ()' will encode only a categorical variable of type string. I search on the line and see several similar questions, but none of them really affect my question.

Handling categorical functions with scikit-learn

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python

+8
python scikit-learn dataframe dummy-variable prediction
source share
3 answers

You can also just get the missing columns and add them to the test dataset:

# Get missing columns in the training test missing_cols = set( train.columns ) - set( test.columns ) # Add a missing column in test set with default value equal to 0 for c in missing_cols: test[c] = 0 # Ensure the order of column in the test set is in the same order than in train set test = test[train.columns] 

This code also ensures that a column retrieved from a category in the test data set but not contained in the training dataset is deleted

+12
source share

Suppose you have identical function names in a train and test dataset. You can generate a concatenated dataset from a train and test, get dummies from a concatenated dataset, and divide it into training and testing.

You can do it as follows:

 import pandas as pd train = pd.DataFrame(data = [['a', 123, 'ab'], ['b', 234, 'bc']], columns=['col1', 'col2', 'col3']) test = pd.DataFrame(data = [['c', 345, 'ab'], ['b', 456, 'ab']], columns=['col1', 'col2', 'col3']) train_objs_num = len(train) dataset = pd.concat(objs=[train, test], axis=0) dataset_preprocessed = pd.get_dummies(dataset) train_preprocessed = dataset_preprocessed[:train_objs_num] test_preprocessed = dataset_preprocessed[train_objs_num:] 

As a result, you have an equal number of functions for data collection for trains and tests.

+11
source share
 train2,test2 = train.align(test, join='outer', axis=1, fill_value=0) 

train2 and test2 have the same columns. Fill_value indicates the value used for missing columns.

+1
source share

All Articles