NumPy sums one array based on the values โ€‹โ€‹in another array for each corresponding element in the 3rd array

I have two numpy arrays, one of which contains values โ€‹โ€‹and one containing each category of values.

values=np.array([1,2,3,4,5,6,7,8,9,10]) valcats=np.array([101,301,201,201,102,302,302,202,102,301]) 

I have another array containing unique categories that I would like to summarize.

 categories=np.array([101,102,201,202,301,302]) 

My problem is that I will perform the same summation process several billion times, and every microsecond matters.

My current implementation is as follows.

 catsums=[] for x in categories: catsums.append(np.sum(values[np.where(valcats==x)])) 

Received catsums must be:

 [1, 14, 7, 8, 12, 13] 

My current runtime is about 5 ฮผs. I'm still a little new to Python and was hoping to find a quick solution, potentially combining the first two arrays or lamdba or something cool that I don't even know about.

Thanks for reading!

+7
python arrays numpy pandas
source share
2 answers

@ Divacar just posted a very good answer. If you already have an array of certain categories, I would use @Divakar's answer. If you have not already defined unique values, I would use mine.


I would use pd.factorize to separate the categories. Then use np.bincount with the weights parameter, which should be an values array

 f, u = pd.factorize(valcats) np.bincount(f, values).astype(values.dtype) array([ 1, 12, 7, 14, 13, 8]) 

pd.factorize also creates unique values โ€‹โ€‹in the u variable. We can align the results with u to make sure we come to the right solution.

 np.column_stack([u, np.bincount(f, values).astype(values.dtype)]) array([[101, 1], [301, 12], [201, 7], [102, 14], [302, 13], [202, 8]]) 

You can make it more obvious using pd.Series

 f, u = pd.factorize(valcats) pd.Series(np.bincount(f, values).astype(values.dtype), u) 101 1 301 12 201 7 102 14 302 13 202 8 dtype: int64 

Why pd.factorize and not np.unique ?

We could do it equivalently with

  u, f = np.unique(valcats, return_inverse=True) 

But np.unique sorts the values โ€‹โ€‹and works in nlogn time. pd.factorize , pd.factorize other hand, pd.factorize not sort and does not work in linear time. For large datasets, pd.factorize will dominate performance.

+7
source share

You can use searchsorted and bincount -

 np.bincount(np.searchsorted(categories, valcats), values) 
+7
source share

All Articles