How to create an interaction design matrix from categorical variables?

I work mainly in R for statistical modeling / machine learning and want to improve my skills in Python. I am wondering how to best create a design matrix for categorical interactions (to an arbitrary degree) in python.

Toy example:

import pandas as pd from urllib import urlopen page = urlopen("http://www.shatterline.com/MachineLearning/data/tennis_anyone.csv") df = pd.read_csv(page) df.head(n=5) 

enter image description here

Suppose we want to create interactions between Outlook, Temp, and Humidity. Is there an effective way to do this? I can manually do something like this in pandas:

 OutTempFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Outlook.values, df.Temperature.values]))[0],name='OutTemp') OutHumFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Outlook.values, df.Humidity.values]))[0],name='OutHum') TempHumFact=pd.Series(pd.factorize(pd.lib.fast_zip([df.Temperature.values, df.Humidity.values]))[0],name='TempHum') IntFacts=pd.concat([OutTempFact,OutHumFact,TempHumFact],axis=1) IntFacts.head(n=5) 

enter image description here

which I could then go on to scikit-learn one-hot encoder, but probably a much better and less manual way of creating interactions between categorical variables without having to go through each combination.

 import sklearn as sk enc = sk.preprocessing.OneHotEncoder() IntFacts_OH=enc.fit_transform(IntFacts) IntFacts_OH.todense() 
+3
source share
2 answers

If you use OneHotEncoder in your design matrix to get a single-string design matrix, then interactions are nothing more than multiplication between columns. If X_1hot is your hot design matrix, where the patterns are lines, then for 2nd order interaction you can write

 X_2nd_order = (X_1hot[:, np.newaxis, :] * X_1hot[:, :, np.newaxis]).reshape(len(X_1hot), -1) 

There will be duplicate interactions, and they will also contain original features.

Going to random order will make your design matrix explode. If you really want this, then you should study a kernel with a polynomial kernel, which will allow you to easily switch to an arbitrary degree.

Using the data frame you presented, we can act as follows. Firstly, a manual way to build one hot design from a data frame:

 import numpy as np indicators = [] state_names = [] for column_name in df.columns: column = df[column_name].values one_hot = (column[:, np.newaxis] == np.unique(column)).astype(float) indicators.append(one_hot) state_names = state_names + ["%s__%s" % (column_name, state) for state in np.unique(column)] X_1hot = np.hstack(indicators) 

Column names are then stored in state_names , and the matrix of indicators is X_1hot . Then we calculate the second-order functions

 X_2nd_order = (X_1hot[:, np.newaxis, :] * X_1hot[:, :, np.newaxis]).reshape(len(X_1hot), -1) 

To find out the column names of the second-order matrix, we build them like this:

 from itertools import product one_hot_interaction_names = ["%s___%s" % (column1, column2) for column1, column2 in product(state_names, state_names)] 
+2
source

Now, faced with a similar problem that requires a simple way to integrate specific interactions from the base OLS model from the literature for comparison with ML ratings, I came across patsy ( http://patsy.readthedocs.io/en/latest/overview.html ), and this integration is scikit-learn patsylearn ( https://github.com/amueller/patsylearn ).

Below, how interaction variables can be passed to models:

 from patsylearn import PatsyModel model = PatsyModel(sk.linear_model.LinearRegression(), "Play-Tennis ~ C(Outlook):C(Temperature) + C(Outlook):C(Humidity) + C(Outlook):C(Wind)") 

Note that in this wording you do not need OneHotEncoder (), since the C in the formula tells the Patsy interpreter that these are categorical variables, and they are very hot for you. But read about it in your documentation ( http://patsy.readthedocs.io/en/latest/categorical-coding.html ).

Or you can also use the PatsyTransformer, which I prefer, as it makes it easy to integrate into Scycit-learn Pipelines:

 from patsylearn import PatsyTransformer transformer = PatsyTransformer("C(Outlook):C(Temperature) + C(Outlook):C(Humidity) + C(Outlook):C(Wind)") 
+2
source

All Articles