I am trying to use Patsy (with sklearn, pandas) to create a simple regression model. Formation of an R style formula is a key indicator.
My data contains the " ship_city " field, which can be any city from India. Since I share data on trains and test sets, there are several cities that appear in only one of the sets. The following is a snippet of code:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe') df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The following error appears on the last line:
patsy.PatsyError: error converting data to categorization: observation with a Calcutta value does not match any of the expected levels
I believe that this is a very common use case when training data will not have all levels of all categorical fields. Sklearn DictVectorizer does a great job of this.
Is there any way to do this with Patsy?
python scikit-learn patsy
DaSarfyCode
source share