Selecting the top n items from each group in pandas groupby

I have a dataframe that looks something like this:

>>> data price currency id 2 1050 EU 5 1400 EU 4 1750 EU 8 4000 EU 7 630 GBP 1 1000 GBP 9 1400 GBP 3 2000 USD 6 7000 USD 

I need to get a new dataframe with n top-rated products for each currency, where n depends on the currency and is indicated in a different frame:

 >>> select_number number_to_select currency GBP 2 EU 2 USD 1 

If I had to choose the same number of the most expensive items, I could group the data by currency using pandas.groupby , and then use the head method for the grouped object.

However, head only accepts a number, not an array or some expression.

Of course, I can write a for loop, but that would be a very inconvenient and inefficient way to do this.

How can this be done in a good way?

+5
source share
3 answers

You can use:

 data = pd.DataFrame({'id': {0: 2, 1: 5, 2: 4, 3: 8, 4: 7, 5: 1, 6: 9, 7: 3, 8: 6}, 'price': {0: 1050, 1: 1400, 2: 1750, 3: 4000, 4: 630, 5: 1000, 6: 1400, 7: 2000, 8: 7000}, 'currency': {0: 'EU', 1: 'EU', 2: 'EU', 3: 'EU', 4: 'GBP', 5: 'GBP', 6: 'GBP', 7: 'USD', 8: 'USD'}}) select_number = pd.DataFrame({'number_to_select': {'USD': 1, 'GBP': 2, 'EU': 2}}) 
 print (data) currency id price 0 EU 2 1050 1 EU 5 1400 2 EU 4 1750 3 EU 8 4000 4 GBP 7 630 5 GBP 1 1000 6 GBP 9 1400 7 USD 3 2000 8 USD 6 7000 print (select_number) number_to_select EU 2 GBP 2 USD 1 

Solution with mapping via dict :

 d = select_number.to_dict() d1 = d['number_to_select'] print (d1) {'USD': 1, 'EU': 2, 'GBP': 2} print (data.groupby('currency').apply(lambda dfg: dfg.nlargest(d1[dfg.name],'price')) .reset_index(drop=True)) currency id price 0 EU 8 4000 1 EU 4 1750 2 GBP 9 1400 3 GBP 1 1000 4 USD 6 7000 

Solution2:

 print (data.groupby('currency') .apply(lambda dfg: (dfg.nlargest(select_number .loc[dfg.name, 'number_to_select'], 'price'))) .reset_index(drop=True)) id price currency 0 8 4000 EU 1 4 1750 EU 2 9 1400 GBP 3 1 1000 GBP 4 6 7000 USD 

Explanation:

I think that for debugging it is best to use the f function with print :

 def f(dfg): #dfg is DataFrame print (dfg) #name of group print (dfg.name) #select value from select_number print (select_number.loc[dfg.name, 'number_to_select']) #return top rows per groups print (dfg.nlargest(select_number.loc[dfg.name, 'number_to_select'], 'price')) return (dfg.nlargest(select_number.loc[dfg.name, 'number_to_select'], 'price')) print (data.groupby('currency').apply(f)) 
  currency id price 0 EU 2 1050 1 EU 5 1400 2 EU 4 1750 3 EU 8 4000 currency id price 0 EU 2 1050 1 EU 5 1400 2 EU 4 1750 3 EU 8 4000 EU 2 currency id price 3 EU 8 4000 2 EU 4 1750 currency id price 4 GBP 7 630 5 GBP 1 1000 6 GBP 9 1400 GBP 2 currency id price 6 GBP 9 1400 5 GBP 1 1000 currency id price 7 USD 3 2000 8 USD 6 7000 USD 1 currency id price 8 USD 6 7000 currency id price currency EU 3 EU 8 4000 2 EU 4 1750 GBP 6 GBP 9 1400 5 GBP 1 1000 USD 8 USD 6 7000 
+8
source

Here is the solution:

 select_number = select_number['number_to_select'] # easier to select from series df.groupby('currency').apply( lambda dfg: dfg.nlargest(select_number[dfg.name], columns='price') ) 

Change I got a response from jezrael answer : I replaced dfg.currency.iloc[0] with dfg.name .

The second edit . As indicated in the comments, select_number is a data framework, so first convert it to a series.

MaxU and jezrael, thanks for your comments!

+3
source

you can do it like this:

 df['rn'] = (df.sort_values(['price'], ascending=False) .groupby('currency').cumcount() + 1 ) qry = (select_number .reset_index() .astype(str) .apply(lambda x: '((currency=="{0[0]}") & (rn<={0[1]}))'.format(x), axis=1) .str.cat(sep=' | ') ) print(df.query(qry)) 

Output

 In [147]: df.query(qry) Out[147]: price currency rn id 4 1750 EU 2 8 4000 EU 1 1 1000 GBP 2 9 1400 GBP 1 6 7000 USD 1 

Explanation:

rn is the auxiliary column - row_number for the section / group, sorted in descending order of price (inside this group)

qry - dynamically generated request

 In [149]: qry Out[149]: '((currency=="EU") & (rn<=2)) | ((currency=="GBP") & (rn<=2)) | ((currency=="USD") & (rn<=1))' 
+1
source

All Articles