I have a set of large arrays (about 6 million elements each) that I want to basically do np.digitize, but with several axes. I am looking for some suggestions on how to do this effectively, and how to save the results.
I need all the indexes (or all values ββor mask) of array A, where the values ββof array B are in a range, and the values ββof array C are in a different range, and D is in another. I want either values, indexes, or a mask so that I can make some of the statistics that are not yet defined on the values ββof array A in each box. I will also need the number of items in each box, but len() can do this.
Here is one example that I developed that seems reasonable:
import itertools import numpy as np A = np.random.random_sample(1e4) B = (np.random.random_sample(1e4) + 10)*20 C = (np.random.random_sample(1e4) + 20)*40 D = (np.random.random_sample(1e4) + 80)*80
This, however, makes me nervous that I donβt have enough memory on large arrays.
I could also do it like this:
b_inds = np.empty((len(A), 10), dtype=np.bool) c_inds = np.empty((len(A), 12), dtype=np.bool) d_inds = np.empty((len(A), 24), dtype=np.bool) for i in range(10): b_inds[:,i] = B_Bidx = i for i in range(12): c_inds[:,i] = C_Cidx = i for i in range(24): d_inds[:,i] = D_Didx = i # get the A data for the 1,2,3 B,C,D bin print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]]
At least here the exit has a known and constant size.
Does anyone have any better thoughts on how to make this smarter? Or clarification that is needed?
Based on HYRY's answer, this is the path I decided to take.
import numpy as np import pandas as pd np.random.seed(42) A = np.random.random_sample(1e7) B = (np.random.random_sample(1e7) + 10)*20 C = (np.random.random_sample(1e7) + 20)*40 D = (np.random.random_sample(1e7) + 80)*80
This method seems lightning fast even for huge arrays.