NumPy sums one array based on the values in another array for each corresponding element in the 3rd array

Question

NumPy sums one array based on the values in another array for each corresponding element in the 3rd array

I have two numpy arrays, one of which contains values and one containing each category of values.

values=np.array([1,2,3,4,5,6,7,8,9,10]) valcats=np.array([101,301,201,201,102,302,302,202,102,301])

I have another array containing unique categories that I would like to summarize.

 categories=np.array([101,102,201,202,301,302])

My problem is that I will perform the same summation process several billion times, and every microsecond matters.

My current implementation is as follows.

 catsums=[] for x in categories: catsums.append(np.sum(values[np.where(valcats==x)]))

Received catsums must be:

 [1, 14, 7, 8, 12, 13]

My current runtime is about 5 μs. I'm still a little new to Python and was hoping to find a quick solution, potentially combining the first two arrays or lamdba or something cool that I don't even know about.

Thanks for reading!

+7

python arrays numpy pandas

hrschbck Jul 23 '17 at 16:24

source share

2 answers

piRSquared · Answer 1 · 2017-07-23T16:29:30+0000

@ Divacar just posted a very good answer. If you already have an array of certain categories, I would use @Divakar's answer. If you have not already defined unique values, I would use mine.

I would use pd.factorize to separate the categories. Then use np.bincount with the weights parameter, which should be an values array

 f, u = pd.factorize(valcats) np.bincount(f, values).astype(values.dtype) array([ 1, 12, 7, 14, 13, 8])

pd.factorize also creates unique values in the u variable. We can align the results with u to make sure we come to the right solution.

 np.column_stack([u, np.bincount(f, values).astype(values.dtype)]) array([[101, 1], [301, 12], [201, 7], [102, 14], [302, 13], [202, 8]])

You can make it more obvious using pd.Series

 f, u = pd.factorize(valcats) pd.Series(np.bincount(f, values).astype(values.dtype), u) 101 1 301 12 201 7 102 14 302 13 202 8 dtype: int64

Why pd.factorize and not np.unique ?

We could do it equivalently with

  u, f = np.unique(valcats, return_inverse=True)

But np.unique sorts the values and works in nlogn time. pd.factorize , pd.factorize other hand, pd.factorize not sort and does not work in linear time. For large datasets, pd.factorize will dominate performance.

Divakar · Answer 2 · 2017-07-23T16:43:39+0000

You can use searchsorted and bincount -

 np.bincount(np.searchsorted(categories, valcats), values)

NumPy sums one array based on the values ​​in another array for each corresponding element in the 3rd array

More articles:

NumPy sums one array based on the values in another array for each corresponding element in the 3rd array