Groupby in python pandas: quick way

I want to improve groupby time in python pandas. I have this code:

 df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len) 

The goal is to calculate the number of contracts a client has per month and add this information to a new column ( Nbcontrats ).

  • Client : client code
  • Month : month of data extraction
  • Contrat : contract number

I want to improve the time. Below I only work with a subset of my real data:

 %timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len) 1 loops, best of 3: 391 ms per loop df.shape Out[309]: (7464, 61) 

How can I improve the runtime?

+6
source share
2 answers

Using the DataFrameGroupBy.size method:

 df.set_index(['Client', 'Month'], inplace=True) df['Nbcontrats'] = df.groupby(level=(0,1)).size() df.reset_index(inplace=True) 

Most of the work is to return the result back to the column of the original DataFrame.

+2
source

Here is one way:

  • Cut the corresponding columns ( ['Client', 'Month'] ) from the input data block into the NumPy array. This is basically a performance-oriented idea, because in the future we will use NumPy functions that are optimized for working with NumPy arrays.

  • Convert the data of two columns from ['Client', 'Month'] into one 1D array, which will be its equivalent linear index, treating elements from two columns as pairs. Thus, we can assume that the elements from 'Client' represent row indices, while 'Month' are column indices. This is like going from 2D to 1D . But the question will determine the shape of the 2D mesh to perform such a mapping. To cover all pairs, one safe assumption would be to have a two-dimensional grid that would be larger than the maximum along each column, due to indexing based on P in language 0. Thus, we get linear indices.

  • Then we mark each linear index based on their uniqueness among others. I think this would match the keys obtained with grouby . We also need to get the counts of each group / unique key along the entire length of this 1D array. Finally, indexing in counts with these tags should display the corresponding values ​​for each element.

That’s the whole idea! Here's the implementation -

 # Save relevant columns as a NumPy array for performing NumPy operations afterwards arr_slice = df[['Client', 'Month']].values # Get linear indices equivalent of those columns lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1) # Get unique IDs corresponding to each linear index (ie group) and grouped counts unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True) # Index counts with the unique tags to map across all elements with the counts df["Nbcontrats"] = counts[unqtags] 

Runtime test

1) Define functions:

 def original_app(df): df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len) def vectorized_app(df): arr_slice = df[['Client', 'Month']].values lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1) unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True) df["Nbcontrats"] = counts[unqtags] 

2) Confirm the results:

 In [143]: # Let create a dataframe with 100 unique IDs and of length 10000 ...: arr = np.random.randint(0,100,(10000,3)) ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat']) ...: df1 = df.copy() ...: ...: # Run the function on the inputs ...: original_app(df) ...: vectorized_app(df1) ...: In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"]) Out[144]: True 

3) Finally, the time:

 In [145]: # Let create a dataframe with 100 unique IDs and of length 10000 ...: arr = np.random.randint(0,100,(10000,3)) ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat']) ...: df1 = df.copy() ...: In [146]: %timeit original_app(df) 1 loops, best of 3: 645 ms per loop In [147]: %timeit vectorized_app(df1) 100 loops, best of 3: 2.62 ms per loop 
+9
source

All Articles