Seaborn is a great package for doing high-level builds with good insights. However, I'm struggling a bit with using Seaborn to overlay both data and model predictions from a model suitable from the outside. In this example, I install models in Statsmodels that are too complicated for Seaborn to do out of the box, but I think the problem is more general (i.e. if I have model predictions and want to visualize them and data using Seaborn) .
Let's start with the import and dataset:
import numpy as np import pandas as pd import seaborn as sns import statsmodels.formula.api as smf import patsy import itertools import matplotlib.pyplot as plt np.random.seed(12345)
We approach the model in statistical models containing all predictor variables and their interactions:
# fit a model with all interactions fit = smf.ols('y ~ x1 * x2 * x3', df).fit() print(fit.summary())
Since in this case we have all the indicated combinations of variables, and our model predictions are linear, to construct a graph it would be enough to add a new column of “predictions” to the data block containing the predictions of the model. However, this is not very general (suppose our model is non-linear and therefore we want our graphs to display smooth curves), so instead I create a new data scheme with all combinations of predictors, and then generate predictions:
# create a new dataframe of predictions, using pandas' expand grid: def expand_grid(data_dict): """ A port of R expand.grid function for use with Pandas dataframes. from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid """ rows = itertools.product(*data_dict.values()) return pd.DataFrame.from_records(rows, columns=data_dict.keys()) # build a new matrix with expand grid: preds = expand_grid( {'x1': np.linspace(df['x1'].min(), df['x1'].max(), 2), 'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds)
The preds as follows:
x3 x1 x2 yhat 0 c -2.370232 a -1.555902 1 c -2.370232 b -2.307295 2 c 3.248944 a -1.555902 3 c 3.248944 b -2.307295 4 d -2.370232 a -1.609652 5 d -2.370232 b -2.837075 6 d 3.248944 a -1.609652 7 d 3.248944 b -2.837075
Since the Seaborn plot commands (unlike the ggplot2 commands in R) seem to accept one and only one data frame, we need to combine our forecasts into raw data:
Now we can build model predictions along with data with our continuous variable x1 as the x axis:
# plot using seaborn: sns.set_style('white') sns.set_context('talk') g = sns.FacetGrid(merged, hue='x2', col='x3', size=5)

So far so good. Now imagine that we did not measure the continuous variable x1 , and we only know about two other (categorical) variables (i.e., we have a factorial design of 2x2). How can we build model predictions for data in this case?
fit = smf.ols('y ~ x2 * x3', df).fit() print(fit.summary()) preds = expand_grid( {'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds) print(preds)
Well, we can build model predictions using sns.pointplot or similar, for example:
# plot using seaborn: g = sns.FacetGrid(merged, hue='x3', size=4) g.map(sns.pointplot, 'x2', 'yhat') g.add_legend(); sns.despine(offset=10);

Or data using sns.factorplot as follows:
g = sns.factorplot('x2', 'y', hue='x3', kind='point', data=merged) sns.despine(offset=10); g.savefig('tmp3.png')

But I don’t see how to create a graph similar to the first (i.e. rows for model predictions using plt.plot , point spread for data using plt.scatter ). The reason is because the x2 variable that I'm trying to use as the x axis is a string / object, so the pyplot commands don't know what to do with them.