Display forecasts of data and models on one site using Seaborn and Statsmodels models

Seaborn is a great package for doing high-level builds with good insights. However, I'm struggling a bit with using Seaborn to overlay both data and model predictions from a model suitable from the outside. In this example, I install models in Statsmodels that are too complicated for Seaborn to do out of the box, but I think the problem is more general (i.e. if I have model predictions and want to visualize them and data using Seaborn) .

Let's start with the import and dataset:

import numpy as np import pandas as pd import seaborn as sns import statsmodels.formula.api as smf import patsy import itertools import matplotlib.pyplot as plt np.random.seed(12345) # make a data frame with one continuous and two categorical variables: df = pd.DataFrame({'x1': np.random.normal(size=100), 'x2': np.tile(np.array(['a', 'b']), 50), 'x3': np.repeat(np.array(['c', 'd']), 50)}) # create a design matrix using patsy: X = patsy.dmatrix('x1 * x2 * x3', df) # some random beta weights: betas = np.random.normal(size=X.shape[1]) # create the response variable as the noisy linear combination of predictors: df['y'] = np.inner(X, betas) + np.random.normal(size=100) 

We approach the model in statistical models containing all predictor variables and their interactions:

 # fit a model with all interactions fit = smf.ols('y ~ x1 * x2 * x3', df).fit() print(fit.summary()) 

Since in this case we have all the indicated combinations of variables, and our model predictions are linear, to construct a graph it would be enough to add a new column of “predictions” to the data block containing the predictions of the model. However, this is not very general (suppose our model is non-linear and therefore we want our graphs to display smooth curves), so instead I create a new data scheme with all combinations of predictors, and then generate predictions:

 # create a new dataframe of predictions, using pandas' expand grid: def expand_grid(data_dict): """ A port of R expand.grid function for use with Pandas dataframes. from http://pandas.pydata.org/pandas-docs/stable/cookbook.html?highlight=expand%20grid """ rows = itertools.product(*data_dict.values()) return pd.DataFrame.from_records(rows, columns=data_dict.keys()) # build a new matrix with expand grid: preds = expand_grid( {'x1': np.linspace(df['x1'].min(), df['x1'].max(), 2), 'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds) 

The preds as follows:

  x3 x1 x2 yhat 0 c -2.370232 a -1.555902 1 c -2.370232 b -2.307295 2 c 3.248944 a -1.555902 3 c 3.248944 b -2.307295 4 d -2.370232 a -1.609652 5 d -2.370232 b -2.837075 6 d 3.248944 a -1.609652 7 d 3.248944 b -2.837075 

Since the Seaborn plot commands (unlike the ggplot2 commands in R) seem to accept one and only one data frame, we need to combine our forecasts into raw data:

 # append to df: merged = df.append(preds) 

Now we can build model predictions along with data with our continuous variable x1 as the x axis:

 # plot using seaborn: sns.set_style('white') sns.set_context('talk') g = sns.FacetGrid(merged, hue='x2', col='x3', size=5) # use the `map` method to add stuff to the facetgrid axes: g.map(plt.plot, "x1", "yhat") g.map(plt.scatter, "x1", "y") g.add_legend() g.fig.subplots_adjust(wspace=0.3) sns.despine(offset=10); 

enter image description here

So far so good. Now imagine that we did not measure the continuous variable x1 , and we only know about two other (categorical) variables (i.e., we have a factorial design of 2x2). How can we build model predictions for data in this case?

 fit = smf.ols('y ~ x2 * x3', df).fit() print(fit.summary()) preds = expand_grid( {'x2': ['a', 'b'], 'x3': ['c', 'd']}) preds['yhat'] = fit.predict(preds) print(preds) # append to df: merged = df.append(preds) 

Well, we can build model predictions using sns.pointplot or similar, for example:

 # plot using seaborn: g = sns.FacetGrid(merged, hue='x3', size=4) g.map(sns.pointplot, 'x2', 'yhat') g.add_legend(); sns.despine(offset=10); 

enter image description here

Or data using sns.factorplot as follows:

 g = sns.factorplot('x2', 'y', hue='x3', kind='point', data=merged) sns.despine(offset=10); g.savefig('tmp3.png') 

enter image description here

But I don’t see how to create a graph similar to the first (i.e. rows for model predictions using plt.plot , point spread for data using plt.scatter ). The reason is because the x2 variable that I'm trying to use as the x axis is a string / object, so the pyplot commands don't know what to do with them.

+5
source share
1 answer

As I mentioned in my comments, there are two ways that I could think about this.

First, you need to define a function that performs the fit, and then displays and passes it to FacetGrid.map :

 import pandas as pd import seaborn as sns tips = sns.load_dataset("tips") def plot_good_tip(day, total_bill, **kws): expected_tip = (total_bill.groupby(day) .mean() .apply(lambda x: x * .2) .reset_index(name="tip")) sns.pointplot(expected_tip.day, expected_tip.tip, linestyles=["--"], markers=["D"]) g = sns.FacetGrid(tips, col="sex", size=5) g.map(sns.pointplot, "day", "tip") g.map(plot_good_tip, "day", "total_bill") g.set_axis_labels("day", "tip") 

enter image description here

The second is calculating the predicted values ​​and then combining them into your DataFrame with an additional variable that identifies what the data is and what the model is:

 tip_predict = (tips.groupby(["day", "sex"]) .total_bill .mean() .apply(lambda x: x * .2) .reset_index(name="tip")) tip_all = pd.concat(dict(data=tips[["day", "sex", "tip"]], model=tip_predict), names=["kind"]).reset_index() sns.factorplot("day", "tip", "kind", data=tip_all, col="sex", kind="point", linestyles=["-", "--"], markers=["o", "D"]) 

enter image description here

+4
source

Source: https://habr.com/ru/post/1212281/


All Articles