Display forecasts of data and models on one site using Seaborn and Statsmodels models

Question

Display forecasts of data and models on one site using Seaborn and Statsmodels models

Seaborn is a great package for doing high-level builds with good insights. However, I'm struggling a bit with using Seaborn to overlay both data and model predictions from a model suitable from the outside. In this example, I install models in Statsmodels that are too complicated for Seaborn to do out of the box, but I think the problem is more general (i.e. if I have model predictions and want to visualize them and data using Seaborn) .

Let's start with the import and dataset:

import numpy as np import pandas as pd import seaborn as sns import statsmodels.formula.api as smf import patsy import itertools import matplotlib.pyplot as plt np.random.seed(12345) # make a data frame with one continuous and two categorical variables: df = pd.DataFrame({'x1': np.random.normal(size=100), 'x2': np.tile(np.array(['a', 'b']), 50), 'x3': np.repeat(np.array(['c', 'd']), 50)}) # create a design matrix using patsy: X = patsy.dmatrix('x1 * x2 * x3', df) # some random beta weights: betas = np.random.normal(size=X.shape[1]) # create the response variable as the noisy linear combination of predictors: df['y'] = np.inner(X, betas) + np.random.normal(size=100)

We approach the model in statistical models containing all predictor variables and their interactions:

 # fit a model with all interactions fit = smf.ols('y ~ x1 * x2 * x3', df).fit() print(fit.summary())

Since in this case we have all the indicated combinations of variables, and our model predictions are linear, to construct a graph it would be enough to add a new column of “predictions” to the data block containing the predictions of the model. However, this is not very general (suppose our model is non-linear and therefore we want our graphs to display smooth curves), so instead I create a new data scheme with all combinations of predictors, and then generate predictions:

 # create a new dataframe of predictions, using pandas' expand grid: def expand_grid(data_dict): """ A port of R expand.grid function for use with Pandas dataframes. from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid """ rows = itertools.product(*data_dict.values()) return pd.DataFrame.from_records(rows, columns=data_dict.keys()) # build a new matrix with expand grid: preds = expand_grid( {'x1': np.linspace(df['x1'].min(), df['x1'].max(), 2), 'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds)

The preds as follows:

  x3 x1 x2 yhat 0 c -2.370232 a -1.555902 1 c -2.370232 b -2.307295 2 c 3.248944 a -1.555902 3 c 3.248944 b -2.307295 4 d -2.370232 a -1.609652 5 d -2.370232 b -2.837075 6 d 3.248944 a -1.609652 7 d 3.248944 b -2.837075

Since the Seaborn plot commands (unlike the ggplot2 commands in R) seem to accept one and only one data frame, we need to combine our forecasts into raw data:

 # append to df: merged = df.append(preds)

Now we can build model predictions along with data with our continuous variable x1 as the x axis:

 # plot using seaborn: sns.set_style('white') sns.set_context('talk') g = sns.FacetGrid(merged, hue='x2', col='x3', size=5) # use the `map` method to add stuff to the facetgrid axes: g.map(plt.plot, "x1", "yhat") g.map(plt.scatter, "x1", "y") g.add_legend() g.fig.subplots_adjust(wspace=0.3) sns.despine(offset=10);

So far so good. Now imagine that we did not measure the continuous variable x1 , and we only know about two other (categorical) variables (i.e., we have a factorial design of 2x2). How can we build model predictions for data in this case?

 fit = smf.ols('y ~ x2 * x3', df).fit() print(fit.summary()) preds = expand_grid( {'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds) print(preds) # append to df: merged = df.append(preds)

Well, we can build model predictions using sns.pointplot or similar, for example:

 # plot using seaborn: g = sns.FacetGrid(merged, hue='x3', size=4) g.map(sns.pointplot, 'x2', 'yhat') g.add_legend(); sns.despine(offset=10);

Or data using sns.factorplot as follows:

 g = sns.factorplot('x2', 'y', hue='x3', kind='point', data=merged) sns.despine(offset=10); g.savefig('tmp3.png')

But I don’t see how to create a graph similar to the first (i.e. rows for model predictions using plt.plot , point spread for data using plt.scatter ). The reason is because the x2 variable that I'm trying to use as the x axis is a string / object, so the pyplot commands don't know what to do with them.

+5

python matplotlib statsmodels seaborn

tsawallis Jan 30 '15 at 15:36

source share

1 answer

mwaskom · Accepted Answer · 2015-01-30T19:56:00+0000

As I mentioned in my comments, there are two ways that I could think about this.

First, you need to define a function that performs the fit, and then displays and passes it to FacetGrid.map :

 import pandas as pd import seaborn as sns tips = sns.load_dataset("tips") def plot_good_tip(day, total_bill, **kws): expected_tip = (total_bill.groupby(day) .mean() .apply(lambda x: x * .2) .reset_index(name="tip")) sns.pointplot(expected_tip.day, expected_tip.tip, linestyles=["--"], markers=["D"]) g = sns.FacetGrid(tips, col="sex", size=5) g.map(sns.pointplot, "day", "tip") g.map(plot_good_tip, "day", "total_bill") g.set_axis_labels("day", "tip")

The second is calculating the predicted values and then combining them into your DataFrame with an additional variable that identifies what the data is and what the model is:

 tip_predict = (tips.groupby(["day", "sex"]) .total_bill .mean() .apply(lambda x: x * .2) .reset_index(name="tip")) tip_all = pd.concat(dict(data=tips[["day", "sex", "tip"]], model=tip_predict), names=["kind"]).reset_index() sns.factorplot("day", "tip", "kind", data=tip_all, col="sex", kind="point", linestyles=["-", "--"], markers=["o", "D"])

Display forecasts of data and models on one site using Seaborn and Statsmodels models

More articles: