Add serial number for each item in group using python

Question

Add serial number for each item in group using python

I have a dataframe of people, each of which has several records. I want to list a record in sequence for each person in python. Essentially, I would like to create the "sequence" column in the following table:

patient date sequence 145 20Jun2009 1 145 24Jun2009 2 145 15Jul2009 3 582 09Feb2008 1 582 21Feb2008 2 987 14Mar2010 1 987 02May2010 2 987 12May2010 3

This is essentially the same question as here , but I am working on python and cannot implement the sql solution. I suspect that I can use the groupby operator with an iterable account, but so far has not been successful. Thanks!

+5

python database pandas count grouping

Dka Mar 30 '15 at 18:03

source share

3 answers

I came across an answer that was awkwardly simple. The groupby operator has a cumcount () option that will list the elements of the group.

 df['sequence']=df.groupby('patient').cumcount()

The caveat is that the entries should be in the order in which you want to list them.

+19

Dka Mar 30 '15 at 18:38

source share

Firstly, you want to convert the date column to datetime pandas (and not rows):

 In [11]: pd.to_datetime(df['date'], format='%d%b%Y') Out[11]: 0 2009-06-20 1 2009-06-24 2 2009-07-15 3 2008-02-09 4 2008-02-21 5 2010-03-14 6 2010-05-02 7 2010-05-12 Name: date, dtype: datetime64[ns]

Note: see docs for possible format options.

 In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y') In [13]: df Out[13]: patient date sequence 0 145 2009-06-20 1 1 145 2009-06-24 2 2 145 2009-07-15 3 3 582 2008-02-09 1 4 582 2008-02-21 2 5 987 2010-03-14 1 6 987 2010-05-02 2 7 987 2010-05-12 3

If this is not indicated in date order (for each patient), I would sort it first:

 In [14]: df = df.sort('date')

Now you can group and copy:

 In [15]: g = df.groupby('patient') In [16]: g.cumcount() + 1 Out[16]: 2 1 3 2 0 1 1 2 4 1 5 2 6 3 dtype: int64

This is what you want (all this is not in order):

 In [17]: df['sequence'] = g.cumcount() + 1 In [18]: df Out[18]: patient date sequence 2 582 2008-02-09 1 3 582 2008-02-21 2 0 145 2009-06-24 1 1 145 2009-07-15 2 4 987 2010-03-14 1 5 987 2010-05-02 2 6 987 2010-05-12 3

To reorder (although you may not need to) use sort_index (or we could sort_index if we kept the original DataFrame index): *

 In [19]: df.sort_index() Out[19]: patient date sequence 0 145 2009-06-24 1 1 145 2009-07-15 2 2 582 2008-02-09 1 3 582 2008-02-21 2 4 987 2010-03-14 1 5 987 2010-05-02 2 6 987 2010-05-12 3

+1

Andy hayden Apr 2 '15 at 4:55

source share

Jonathan · Accepted Answer · 2015-03-30T18:25:50+0000

The question is how to sort by multiple columns of data.

One simple trick is to use the key parameter for the sorted function.

You will sort by row constructed from the columns of the array.

 rows = ...# your source data def date_to_sortable_string(date): # use datetime package to convert string to sortable date. pass # Assume x[0] === patient_id and x[1] === encounter date # Sort by patient_id and date rows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1]))) for row in rows_sorted: print row

Add serial number for each item in group using python

More articles: