Pandas: how to perform multiple group operations

Question

Pandas: how to perform multiple group operations

I have more experience with Rs data.table , but I'm trying to learn pandas . In data.table I can do something like this:

 > head(dt_m) event_id device_id longitude latitude time_ category 1: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 2: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 3: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 4: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 5: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free 6: 1004583 -100015673884079572 NA NA 1970-01-01 06:34:52 1 free app_id is_active 1: -5305696816021977482 0 2: -7164737313972860089 0 3: -8504475857937456387 0 4: -8807740666788515175 0 5: 5302560163370202064 0 6: 5521284031585796822 0 dt_m_summary <- dt_m[, .( mean_active = mean(is_active, na.rm = TRUE) , median_lat = median(latitude, na.rm = TRUE) , median_lon = median(longitude, na.rm = TRUE) , mean_time = mean(time_) , new_col = your_function(latitude, longitude, time_) ) , by = list(device_id, category) ]

New columns ( mean_active via new_col ), as well as device_id and category will appear in dt_m_summary . I could also do a similar by conversion on the source table if I need a new column that has groupby-apply results:

dt_m[, mean_active := mean(is_active, na.rm = TRUE), by = list(device_id, category)]

(in case I wanted, for example, to select strings where mean_active greater than some threshold, or to do something else).

I know there is groupby in pandas , but I have not found a way to make it look like simple transforms, as mentioned above. The best I could think of is to make a series of group requests and then combine the results into a single dataframe , but that seems very awkward. Is there a better way to do this?

+6

python pandas group-by dataframe pandas-groupby

Anarcho-chossid Aug 26 '16 at 6:14

source share

2 answers

piRSquared · Answer 1 · 2016-08-26T06:20:40+0000

IIUC, use groupby and agg . See docs for more details.

 df = pd.DataFrame(np.random.rand(10, 2), pd.MultiIndex.from_product([list('XY'), range(5)]), list('AB')) df

 df.groupby(level=0).agg(['sum', 'count', 'std'])

A more individual example would be

 # level=0 means group by the first level in the index # if there is a specific column you want to group by # use groupby('specific column name') df.groupby(level=0).agg({'A': ['sum', 'std'], 'B': {'my_function': lambda x: x.sum() ** 2}})

Note dict passed to the agg method has the keys 'A' and 'B' . This means that for 'A' and lambda x: x.sum() ** 2 for 'B' functions ['sum', 'std'] are executed (and name it 'my_function' )

Note 2 related to your new_column . agg requires that the functions passed reduce the columns to scalars. You better add a new column before groupby / agg

n8yoder · Answer 2 · 2016-08-27T03:54:08+0000

@piRSquared has a great answer, but in your particular case, I think you might be interested in using pandas to use the function very flexibly. Since it can be applied to each group one at a time, you can work with several columns at the same time inside a grouped DataFrame.

 def your_function(sub_df): return np.mean(np.cos(sub_df['latitude']) + np.sin(sub_df['longitude']) - np.tan(sub_df['time_'])) def group_function(g): return pd.Series([g['is_active'].mean(), g['latitude'].median(), g['longitude'].median(), g['time_'].mean(), your_function(g)], index=['mean_active', 'median_lat', 'median_lon', 'mean_time', 'new_col']) dt_m.groupby(['device_id', 'category']).apply(group_function)

However, I definitely agree with @piRSquared, which would be very useful to see a complete example, including the expected result.

Pandas: how to perform multiple group operations

More articles: