Is pandas.DataFrame.groupby guaranteed to be stable?

I noticed that there are several uses for pd.DataFrame.groupby followed by apply , implying that groupby is stable - that is, if a and b are instances of the same group and are pre-grouped, a appears before b, then a will be displayed pre b after grouping,

I think there are several answers that explicitly use this, but to be specific, here one uses groupby + cumsum .

Is there anything really promising this kind of behavior? The documentation only states:

A group series using mapper (a dict or key function, applies this function to a group, returns the result as a series) or a series of columns.

Besides pandas with indexes, it would theoretically be possible to implement functionality without this guarantee (albeit more cumbersome).

+7
python language-lawyer pandas group-by
source share
1 answer

Although documents do not indicate this internally, it uses stable sorting when creating groups.

See:

As I mentioned in the comments, this is important if you consider transform , which will return a Series with an index aligned with the original df. If the sorting does not preserve order, then this will result in the alignment performing additional work, since it will be necessary to sort the Series before the assignment. This is actually mentioned in the comments :

_algos.groupsort_indexer implements a sort count , and this is at least O(ngroups) , where

ngroups = prod(shape)

shape = map(len, keys)

That is, linear in the number of combinations (Cartesian product) of the unique values โ€‹โ€‹of group keys. This can be huge when working with multiple keys. np.argsort(kind='mergesort') is O(count x log(count)) , where count is the length of the data frame; Both algorithms are stable , and this is necessary for the correctness of group operations.

eg. consider: df.groupby(key)[col].transform('first')

+6
source share

All Articles