Pandas dataframe get the first row of each group

Question

Pandas dataframe get the first row of each group

I have a pandas DataFrame as shown below.

 df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7], 'value' : ["first","second","second","first", "second","first","third","fourth", "fifth","second","fifth","first", "first","second","third","fourth","fifth"]})

I want to group this by ["id", "value"] and get the first row of each group.

  id value 0 1 first 1 1 second 2 1 second 3 2 first 4 2 second 5 3 first 6 3 third 7 3 fourth 8 3 fifth 9 4 second 10 4 fifth 11 5 first 12 6 first 13 6 second 14 6 third 15 7 fourth 16 7 fifth

Expected Result

  id value 1 first 2 first 3 first 4 second 5 first 6 first 7 fourth

I tried, after which it displays the first line of the DataFrame . Any help in this regard is appreciated.

 In [25]: for index, row in df.iterrows(): ....: df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

+90

python pandas dataframe

Nilani Algiriyage Nov 19 '13 at 9:24

source share

5 answers

This will give you the second line of each group (zero is indexed, nth (0) matches the first ()):

 df.groupby('id').nth(1)

Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

+35

wij Mar 18 '16 at 0:03

source share

I would suggest using .nth(0) instead of .first() if you need to get the first row.

The difference between the two is how they handle NaN, so .nth(0) will return the first row of the group regardless of the values in that row, while .first() will ultimately return the first non- NaN value in each column.

For example, if your dataset is:

 df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4], 'value' : ["first","second","third", np.NaN, "second","first","second","third", "fourth","first","second"]}) >>> df.groupby('id').nth(0) value id 1 first 2 NaN 3 first 4 first

As well as

 >>> df.groupby('id').first() value id 1 first 2 second 3 first 4 first

+11

vital_dml Mar 07 '18 at 9:54

source share

perhaps this is what you want

 import pandas as pd idx = pd.MultiIndex.from_product([['state1','state2'], ['county1','county2','county3','county4']]) df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)

  pop state1 county1 12 county2 15 county3 65 county4 42 state2 county1 78 county2 67 county3 55 county4 31

 df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3) > Out[29]: pop state1 county3 65 county4 42 county2 15 state2 county1 78 county2 67 county3 55

+4

Siraj S. Oct 28 '16 at 18:39

source share

If you only need the first row from each group, which we can do with drop_duplicates , pay attention to the default method for the keep='first' function.

 df.drop_duplicates('id') Out[1027]: id value 0 1 first 3 2 first 5 3 first 9 4 second 11 5 first 12 6 first 15 7 fourth

+1

Wen-Ben Mar 20 '19 at 21:01

source share

Roman Pekar · Accepted Answer · 2013-11-19 09:25

 >>> df.groupby('id').first() value id 1 first 2 first 3 first 4 second 5 first 6 first 7 fourth

If you need id as a column:

 >>> df.groupby('id').first().reset_index() id value 0 1 first 1 2 first 2 3 first 3 4 second 4 5 first 5 6 first 6 7 fourth

To get the n first entries, you can use head ():

 >>> df.groupby('id').head(2).reset_index(drop=True) id value 0 1 first 1 1 second 2 2 first 3 2 second 4 3 first 5 3 third 6 4 second 7 4 fifth 8 5 first 9 6 first 10 6 second 11 7 fourth 12 7 fifth

Pandas dataframe get the first row of each group

More articles: