Patsy: new levels in categorical fields in test data

I am trying to use Patsy (with sklearn, pandas) to create a simple regression model. Formation of an R style formula is a key indicator.

My data contains the " ship_city " field, which can be any city from India. Since I share data on trains and test sets, there are several cities that appear in only one of the sets. The following is a snippet of code:

df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe') df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe') 

The following error appears on the last line:

patsy.PatsyError: error converting data to categorization: observation with a Calcutta value does not match any of the expected levels

I believe that this is a very common use case when training data will not have all levels of all categorical fields. Sklearn DictVectorizer does a great job of this.

Is there any way to do this with Patsy?

+7
python scikit-learn patsy
source share
2 answers

Of course, the problem is that if you simply give patsy an unprocessed list of values, it has no way of knowing that there are other values ​​that may also occur. You must somehow say what a complete set of possible values.

One way is to use the levels= argument for C(...) , for example:

 # If you have a data frame with all the data before splitting: all_cities = sorted(df_all["Cities"].unique()) # Alternative approach: all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"]))) dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train) 

Another option, if you use pandas default categorical support , record a set of possible values ​​when setting up your data frame ; if patsy discovers that the object you passed in is categorical pandas, then it automatically uses the pandas category attribute instead of trying to guess which categories are possible by looking at the data.

+3
source share

I had a similar problem, and I built a design matrix before splitting the data.

 df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe') df_train_X, df_test_X, df_train_Y, df_test_Y = \ train_test_split(df_X, df_Y, test_size=test_size) 

Then, as an example of applying compliance:

 model = smf.OLS(df_train_Y, df_train_X) model2 = model.fit() predicted = model2.predict(df_test_X) 

Technically, I did not build a test script, but again I did not encounter the Error converting data to categorical error after implementing the above.

0
source share

All Articles