I would like to use pandas and statsmodels to fit the linear model on the subsets of the data frame and return the predicted values. However, it is difficult for me to understand the correct idiom pandas. Here is what I am trying to do:
import pandas as pd import statsmodels.formula.api as sm import seaborn as sns tips = sns.load_dataset("tips") def fit_predict(df): m = sm.ols("tip ~ total_bill", df).fit() return pd.Series(m.predict(df), index=df.index) tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
This causes the following error:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-139-b3d2575e2def> in <module>() ----> 1 tips["predicted_tip"] = tips.groupby("day").transform(fit_predict) /Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs) 3033 return self._transform_general(func, *args, **kwargs) 3034 except: -> 3035 return self._transform_general(func, *args, **kwargs) 3036 3037 # a reduction transform /Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _transform_general(self, func, *args, **kwargs) 2988 group.T.values[:] = res 2989 else: -> 2990 group.values[:] = res 2991 2992 applied.append(group) ValueError: could not broadcast input array from shape (62) into shape (62,6)
The error makes sense in that I think .transform wants to map a DataFrame to a DataFrame. But is there a way to do a groupby operation on a DataFrame, pass each piece to a function that reduces it to a Series (with the same index), and then combine the resulting Series into something that can be inserted into the original framework?
source share