I came from sql background and often use the following data processing step:
- Split a data table into one or more fields
- For each section, add a number to each of its lines, which ranks the line in one or more other fields, where the analyst indicates the ascending or descending
Example:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'], 'data1' : [1,2,2,3,3], 'data2' : [1,10,2,3,30]}) df data1 data2 key1 0 1 1 a 1 2 10 a 2 2 2 a 3 3 3 b 4 3 30 a
I am looking for how to make PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC) data1 data2 key1 RN 0 1 1 a 1 1 2 10 a 2 2 2 2 a 3 3 3 3 b 1 4 3 30 a 4
I tried the following, which I got to work where there are no "partitions":
def row_number(frame,orderby_columns, orderby_direction,name): frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True) frame[name] = list(xrange(len(frame.index)))
I tried to expand this idea to work with sections (groups in pandas), but the following did not work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index() def nf(x): x['rn'] = list(xrange(len(x.index))) df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there would be a short way to reproduce the possibility of the sql window function (I found out that window-based aggregates ... that is one liner in pandas) ... can someone share the number of lines with me in the most idiomatic way, like in PANDAS ?