Pandas is a good approach to getting top n entries in each group

Suppose I have a pandas DataFrame:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]}) >>> df id value 0 1 1 1 1 2 2 1 3 3 2 1 4 2 2 5 2 3 6 2 4 7 3 1 8 4 1 

I want to get a new DataFrame with top 2 records for each id, for example:

  id value 0 1 1 1 1 2 3 2 1 4 2 2 7 3 1 8 4 1 

I can do this by numbering the entries in the group after the group:

 >>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index() >>> dfN id level_1 index value 0 1 0 0 1 1 1 1 1 2 2 1 2 2 3 3 2 0 3 1 4 2 1 4 2 5 2 2 5 3 6 2 3 6 4 7 3 0 7 1 8 4 0 8 1 >>> dfN[dfN['level_1'] <= 1][['id', 'value']] id value 0 1 1 1 1 2 3 2 1 4 2 2 7 3 1 8 4 1 

But is there a more efficient / elegant approach for this? There is also a more elegant approach to numerical entries in each group (for example, the SQL window function row_number () ).

Thanks in advance.

+62
python pandas greatest-n-per-group window-functions
Nov 19 '13 at 10:28
source share
2 answers

Have you df.groupby('id').head(2)

Generated output:

 >>> df.groupby('id').head(2) id value id 1 0 1 1 1 1 2 2 3 2 1 4 2 2 3 7 3 1 4 8 4 1 

(Keep in mind that you may need to order / sort before, depending on your details)

EDIT: As mentioned in the question, use df.groupby('id').head(2).reset_index(drop=True) to remove the multi-index and smooth the results.

 >>> df.groupby('id').head(2).reset_index(drop=True) id value 0 1 1 1 1 2 2 2 1 3 2 2 4 3 1 5 4 1 
+76
Nov 19 '13 at 10:46 on
source share

Starting with version 0.14.1 you can now do nlargest and nsmallest in the groupby object:

 In [23]: df.groupby('id')['value'].nlargest(2) Out[23]: id 1 2 3 1 2 2 6 4 5 3 3 7 1 4 8 1 dtype: int64 

There's a little oddity that you also get the source index, but it can be really useful depending on your source index.

If you are not interested in this, you can do .reset_index(level=1, drop=True) to completely get rid of it.

(Note: From 0.17.1, you can do this in a DataFrameGroupBy, but so far it only works with Series and SeriesGroupBy .)

+65
Sep 04 '15 at 12:14
source share



All Articles