The Pandas GroupBy.apply method duplicates the first group

Question

The Pandas GroupBy.apply method duplicates the first group

My first SO question: I am confused by this behavior of the groupby method in pandas (0.12.0-4), it seems to apply the TWICE function to the first line of the data frame. For example:

>>> from pandas import Series, DataFrame >>> import pandas as pd >>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]}) >>> print(df) class count 0 A 1 1 B 0 2 C 2

First I will check that the groupby function is working fine and it seems to be fine:

 >>> for group in df.groupby('class', group_keys = True): >>> print(group) ('A', class count 0 A 1) ('B', class count 1 B 0) ('C', class count 2 C 2)

Then I try to do something like this using apply on the groupby object, and I get the first line output twice:

 >>> def checkit(group): >>> print(group) >>> df.groupby('class', group_keys = True).apply(checkit) class count 0 A 1 class count 0 A 1 class count 1 B 0 class count 2 C 2

Any help would be appreciated! Thank.

Edit: @Jeff provides the answer below. I was tight and didn’t understand right away, so here is a simple example to show that, despite the double listing of the first group in the above example, the apply method works only once in the first group and does not mutate the original data frame

 >>> def addone(group): >>> group['count'] += 1 >>> return group >>> df.groupby('class', group_keys = True).apply(addone) >>> print(df) class count 0 A 1 1 B 0 2 C 2

But by assigning a method return to a new object, we see that it works as expected:

df2 = df.groupby ('class', group_keys = True) .apply (addone) print (df2)

  class count 0 A 2 1 B 1 2 C 3

+35

python python-2.7 pandas group-by pandas-groupby

NC maize breeding Jim Jan 27 '14 at 19:37

source share

3 answers

This "problem" is now fixed: upgrade to 0. 25+

Starting with v0.25, GroupBy.apply() will evaluate the first group only once. See GH24748 .

Relevant example from the documentation:

 pd.__version__ # '0.25.0.dev0+590.g44d5498d8' df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]}) def func(group): print(group.name) return group

New behavior (> = v0.25):

 df.groupby('a').apply(func) x y ab 0 x 1 1 y 2

Old behavior (<= v0.24.x):

 df.groupby('a').apply(func) x x y ab 0 x 1 1 y 2

Pandas still uses the first group to determine if apply can a quick path or not. But at least you no longer need to evaluate the first group twice. Good work, developers!

+4

cs95 May 20 '19 at 6:32

source share

You can use for a loop to avoid duplicating groupby.apply of the first line,

log_sample.csv

 guestid,keyword 1,null 2,null 2,null 3,null 3,null 3,null 4,null 4,null 4,null 4,null

my code snippet

 df=pd.read_csv("log_sample.csv") grouped = df.groupby("guestid") for guestid, df_group in grouped: print(list(df_group['guestid'])) df.head(100)

exit

 [1] [2, 2] [3, 3, 3] [4, 4, 4, 4]

+1

geosmart Apr 04 '18 at 3:17

source share

Zero · Accepted Answer · 2014-09-08 01:39

It is by design as described here and here

The apply function must know the form of the returned data in order to intelligently determine how it will be combined. To do this, it calls the function ( checkit in your case) twice to achieve this.

Depending on your actual use case, you can replace the apply call with aggregate , transform or filter , as described in detail here . These functions require the return value to be in a specific form, so do not call this function twice.

However, if the function you are calling has no side effects, it probably does not matter that the function is called twice at the first value.

The Pandas GroupBy.apply method duplicates the first group

This "problem" is now fixed: upgrade to 0. 25+

More articles: