Extending numpy.digitize to multidimensional data

Question

Extending numpy.digitize to multidimensional data

I have a set of large arrays (about 6 million elements each) that I want to basically do np.digitize, but with several axes. I am looking for some suggestions on how to do this effectively, and how to save the results.

I need all the indexes (or all values or mask) of array A, where the values of array B are in a range, and the values of array C are in a different range, and D is in another. I want either values, indexes, or a mask so that I can make some of the statistics that are not yet defined on the values of array A in each box. I will also need the number of items in each box, but len() can do this.

Here is one example that I developed that seems reasonable:

 import itertools import numpy as np A = np.random.random_sample(1e4) B = (np.random.random_sample(1e4) + 10)*20 C = (np.random.random_sample(1e4) + 20)*40 D = (np.random.random_sample(1e4) + 80)*80 # make the edges of the bins Bbins = np.linspace(B.min(), B.max(), 10) Cbins = np.linspace(C.min(), C.max(), 12) # note different number Dbins = np.linspace(D.min(), D.max(), 24) # note different number B_Bidx = np.digitize(B, Bbins) C_Cidx = np.digitize(C, Cbins) D_Didx = np.digitize(D, Dbins) a_bins = [] for bb, cc, dd in itertools.product(np.unique(B_Bidx), np.unique(C_Cidx), np.unique(D_Didx)): a_bins.append([(bb, cc, dd), [A[np.bitwise_and((B_Bidx==bb), (C_Cidx==cc), (D_Didx==dd))]]])

This, however, makes me nervous that I don’t have enough memory on large arrays.

I could also do it like this:

 b_inds = np.empty((len(A), 10), dtype=np.bool) c_inds = np.empty((len(A), 12), dtype=np.bool) d_inds = np.empty((len(A), 24), dtype=np.bool) for i in range(10): b_inds[:,i] = B_Bidx = i for i in range(12): c_inds[:,i] = C_Cidx = i for i in range(24): d_inds[:,i] = D_Didx = i # get the A data for the 1,2,3 B,C,D bin print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]]

At least here the exit has a known and constant size.

Does anyone have any better thoughts on how to make this smarter? Or clarification that is needed?

Based on HYRY's answer, this is the path I decided to take.

 import numpy as np import pandas as pd np.random.seed(42) A = np.random.random_sample(1e7) B = (np.random.random_sample(1e7) + 10)*20 C = (np.random.random_sample(1e7) + 20)*40 D = (np.random.random_sample(1e7) + 80)*80 # make the edges of the bins we want Bbins = np.linspace(B.min(), B.max(), 9) Cbins = np.linspace(C.min(), C.max(), 10) # note different number Dbins = np.linspace(D.min(), D.max(), 11) # note different number sA = pd.Series(A) cB = pd.cut(B, Bbins, include_lowest=True) cC = pd.cut(C, Cbins, include_lowest=True) cD = pd.cut(D, Dbins, include_lowest=True) dat = pd.DataFrame({'A':A, 'cB':cB.labels, 'cC':cC.labels, 'cD':cD.labels}) g = sA.groupby([cB.labels, cC.labels, cD.labels]).indices # this then gives all the indices that match the group print g[0,1,2] # this is all the array A data for that B,C,D bin print sA[g[0,1,2]]

This method seems lightning fast even for huge arrays.

+7

python numpy

Brian larsen Jul 08 '14 at 23:40

source share

1 answer

Hyry · Accepted Answer · 2014-07-09T02:33:40+0000

How about using groupby in Pandas. First fix some problems in your code:

 import itertools import numpy as np np.random.seed(42) A = np.random.random_sample(1e4) B = (np.random.random_sample(1e4) + 10)*20 C = (np.random.random_sample(1e4) + 20)*40 D = (np.random.random_sample(1e4) + 80)*80 # make the edges of the bins Bbins = np.linspace(B.min(), B.max(), 10) Cbins = np.linspace(C.min(), C.max(), 12) # note different number Dbins = np.linspace(D.min(), D.max(), 24) # note different number B_Bidx = np.digitize(B, Bbins) C_Cidx = np.digitize(C, Cbins) D_Didx = np.digitize(D, Dbins) a_bins = [] for bb, cc, dd in itertools.product(np.unique(B_Bidx), np.unique(C_Cidx), np.unique(D_Didx)): a_bins.append([(bb, cc, dd), A[(B_Bidx==bb) & (C_Cidx==cc) & (D_Didx==dd)]]) a_bins[1000]

exit:

 [(4, 6, 17), array([ 0.70723863, 0.907611 , 0.46214047])]

Here is the code that returns the same Pandas result:

 import pandas as pd cB = pd.cut(B, 9) cC = pd.cut(C, 11) cD = pd.cut(D, 23) sA = pd.Series(A) g = sA.groupby([cB.labels, cC.labels, cD.labels]) g.get_group((3, 5, 16))

exit:

 800 0.707239 2320 0.907611 9388 0.462140 dtype: float64

If you want to calculate the statistics of each group, you can call the g method, for example:

 g.mean()

returns:

 0 0 0 0.343566 1 0.410979 2 0.700007 3 0.189936 4 0.452566 5 0.565330 6 0.539565 7 0.530867 8 0.568120 9 0.587762 11 0.352453 12 0.484903 13 0.477969 14 0.484328 15 0.467357 ... 8 10 8 0.559859 9 0.570652 10 0.656718 11 0.353938 12 0.628980 13 0.372350 14 0.404543 15 0.387920 16 0.742292 17 0.530866 18 0.389236 19 0.628461 20 0.387384 21 0.541831 22 0.573023 Length: 2250, dtype: float64

Extending numpy.digitize to multidimensional data

More articles: