@ Divacar just posted a very good answer. If you already have an array of certain categories, I would use @Divakar's answer. If you have not already defined unique values, I would use mine.
I would use pd.factorize to separate the categories. Then use np.bincount with the weights parameter, which should be an values array
f, u = pd.factorize(valcats) np.bincount(f, values).astype(values.dtype) array([ 1, 12, 7, 14, 13, 8])
pd.factorize also creates unique values โโin the u variable. We can align the results with u to make sure we come to the right solution.
np.column_stack([u, np.bincount(f, values).astype(values.dtype)]) array([[101, 1], [301, 12], [201, 7], [102, 14], [302, 13], [202, 8]])
You can make it more obvious using pd.Series
f, u = pd.factorize(valcats) pd.Series(np.bincount(f, values).astype(values.dtype), u) 101 1 301 12 201 7 102 14 302 13 202 8 dtype: int64
Why pd.factorize and not np.unique ?
We could do it equivalently with
u, f = np.unique(valcats, return_inverse=True)
But np.unique sorts the values โโand works in nlogn time. pd.factorize , pd.factorize other hand, pd.factorize not sort and does not work in linear time. For large datasets, pd.factorize will dominate performance.