SQL-like window functions in PANDAS: String numbering in Python Pandas Dataframe

I came from sql background and often use the following data processing step:

  1. Split a data table into one or more fields
  2. For each section, add a number to each of its lines, which ranks the line in one or more other fields, where the analyst indicates the ascending or descending

Example:

df = pd.DataFrame({'key1' : ['a','a','a','b','a'], 'data1' : [1,2,2,3,3], 'data2' : [1,10,2,3,30]}) df data1 data2 key1 0 1 1 a 1 2 10 a 2 2 2 a 3 3 3 b 4 3 30 a 

I am looking for how to make PANDAS equivalent to this sql window function:

 RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC) data1 data2 key1 RN 0 1 1 a 1 1 2 10 a 2 2 2 2 a 3 3 3 3 b 1 4 3 30 a 4 

I tried the following, which I got to work where there are no "partitions":

 def row_number(frame,orderby_columns, orderby_direction,name): frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True) frame[name] = list(xrange(len(frame.index))) 

I tried to expand this idea to work with sections (groups in pandas), but the following did not work:

 df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index() def nf(x): x['rn'] = list(xrange(len(x.index))) df1['rn1'] = df1.groupby('key1').apply(nf) 

But I just got a lot of NaNs when I do this.

Ideally, there would be a short way to reproduce the possibility of the sql window function (I found out that window-based aggregates ... that is one liner in pandas) ... can someone share the number of lines with me in the most idiomatic way, like in PANDAS ?

+29
python numpy pandas dataframe
Jul 21 '13 at 19:16
source share
5 answers

You can do this using groupby twice with the rank method:

 In [11]: g = df.groupby('key1') 

Use the argument of the min method to give values ​​that use the same data1, the same RN:

 In [12]: g['data1'].rank(method='min') Out[12]: 0 1 1 2 2 2 3 1 4 4 dtype: float64 In [13]: df['RN'] = g['data1'].rank(method='min') 

And then group these results and add the rank according to the data2:

 In [14]: g1 = df.groupby(['key1', 'RN']) In [15]: g1['data2'].rank(ascending=False) - 1 Out[15]: 0 0 1 0 2 1 3 0 4 0 dtype: float64 In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1 In [17]: df Out[17]: data1 data2 key1 RN 0 1 1 a 1 1 2 10 a 2 2 2 2 a 3 3 3 3 b 1 4 3 30 a 4 

It seems like there should be a custom way to do this (maybe !!).

+14
Jul 21 '13 at 21:24
source share

you can also use sort_values() , groupby() and finally cumcount() + 1 :

 df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \ .groupby(['key1']) \ .cumcount() + 1 print(df) 

gives:

  data1 data2 key1 RN 0 1 1 a 1 1 2 10 a 2 2 2 2 a 3 3 3 3 b 1 4 3 30 a 4 

PS tested with pandas 0.18

+34
Apr 18 '16 at 21:18
source share

You can use transform and Rank together. Here is an example

 df = pd.DataFrame({'C1' : ['a','a','a','b','b'], 'C2' : [1,2,3,4,5]}) df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank()) df 

enter image description here

Take a look at the Pandas Rank method for more information.

+7
Jan 26 '18 at 2:10
source share

pandas.lib.fast_zip() can create an array of tuples from an array list. You can use this function to create a series of tuples and then rank it:

 values = {'key1' : ['a','a','a','b','a','b'], 'data1' : [1,2,2,3,3,3], 'data2' : [1,10,2,3,30,20]} df = pd.DataFrame(values, index=list("abcdef")) def rank_multi_columns(df, cols, **kw): data = [] for col in cols: if col.startswith("-"): flag = -1 col = col[1:] else: flag = 1 data.append(flag*df[col]) values = pd.lib.fast_zip(data) s = pd.Series(values, index=df.index) return s.rank(**kw) rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"])) print rank 

result:

 a 1 b 2 c 3 d 2 e 4 f 1 dtype: float64 
0
Jul 22 '13 at 3:14
source share

Use the groupby.rank function. Here is a working example.

 df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]}) df C1 C2 a 1 a 2 a 3 b 4 b 5 df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True) df C1 C2 RANK a 1 1 a 2 2 a 3 3 b 4 1 b 5 2 
0
Sep 04 '19 at 12:16
source share



All Articles