I am creating a prediction model in python with two separate sets for training and testing. The training data contains a categorical variable of a numerical type, for example, a zip code, [91521,23151,12355, ...], as well as string categorical variables, for example, a city [Chicago, New York, Los Angeles , ...].
To train the data, I first use "pd.get_dummies" to get a dummy variable for this variable, and then fit the model with the converted training data.
I do the same conversion in my test data and predict the result using a trained model. However, I got the error 'ValueError: the number of model functions must match the input. The n_features model is 1487, and the n_features input is 1345 ' . The reason is that there are fewer dummy variables in the test data, because it has less โcityโ and โindexโ.
How can I solve this problem? For example, "OneHotEncoder" will encode only a categorical variable of a numerical type. 'DictVectorizer ()' will encode only a categorical variable of type string. I search on the line and see several similar questions, but none of them really affect my question.
Handling categorical functions with scikit-learn
https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do
https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python
python scikit-learn dataframe dummy-variable prediction
nimning
source share