The most efficient way to implement numpy.in1d ​​for tiered arrays

What is the best way to implement a function that takes an arbitrary number of 1d arrays and returns a tuple containing indices of the corresponding values ​​(if any).

Here is some pseudo code of what I want to do:

a = np.array([1, 0, 4, 3, 2]) b = np.array([1, 2, 3, 4, 5]) c = np.array([4, 2]) (ind_a, ind_b, ind_c) = return_equals(a, b, c) # ind_a = [2, 4] # ind_b = [1, 3] # ind_c = [0, 1] (ind_a, ind_b, ind_c) = return_equals(a, b, c, sorted_by=a) # ind_a = [2, 4] # ind_b = [3, 1] # ind_c = [0, 1] def return_equals(*args, sorted_by=None): ... 
+5
source share
4 answers

You can use numpy.intersect1d with reduce for this:

 def return_equals(*arrays): matched = reduce(np.intersect1d, arrays) return np.array([np.where(np.in1d(array, matched))[0] for array in arrays]) 

reduce may be a little slower here, because here we create intermediate NumPy arrays (for a lot of input, it can be very slow), we can prevent this if we use Python set and its .intersection() method:

 matched = np.array(list(set(arrays[0]).intersection(*arrays[1:]))) 

Associated GitHub ticket: n-massive versions of jobs, especially intersect1d

+5
source

To start, I’ll try:

 def return_equals(*args): x=[] c=args[-1] for a in args: x.append(np.nonzero(np.in1d(a,c))[0]) return x 

If I add d=np.array([1,0,4,3,0]) (it has only 1 match, and what is no match?)

then

 return_equals(a,b,d,c) 

gives:

 [array([2, 4], dtype=int32), array([1, 3], dtype=int32), array([2], dtype=int32), array([0, 1], dtype=int32)] 

Since the length of both the input and return arrays may vary, you really cannot vectorize the problem. That is, to perform the operation on all inputs, some special gymnastics is required. And if the number of arrays is small compared to their typical length, I would not worry about speed. Iterating several times is not expensive. It iterates over 100 values ​​that are expensive.

You could, of course, pass the keyword arguments to in1d .

It is not clear what you are trying to do with the sorted_by parameter. Is that something you could just as easily apply to arrays before passing them to this function?


List the version of this iteration:

  [np.nonzero(np.in1d(x,c))[0] for x in [a,b,d,c]] 

I can imagine how to combine arrays into one long one using in1d and then breaking it into subarrays. There is np.split , but this requires you to specify how many elements should be placed in each sublist. This means that somehow the number of matches for each argument is determined. Doing this without a loop can be tricky.

Parts for this (which still need to be packaged as a function):

 args=[a,b,d,c] lens=[len(x) for x in args] abc=np.concatenate(args) C=np.cumsum(lens) I=np.nonzero(np.in1d(abc,c))[0] S=np.split(I,(2,4,5)) [S[0],S[1]-C[0],S[2]-C[1],S[3]-C[2]] I # array([ 2, 4, 6, 8, 12, 15, 16], dtype=int32) C # array([ 5, 10, 15, 17], dtype=int32) 

(2,4,5) - the number of elements I between successive values ​​of C , that is, the number of elements that correspond to each of a , b , ...

0
source

In Python:

 def return_equal(*args): rtr=[] for i, arr in enumerate(args): rtr.append([j for j, e in enumerate(arr) if all(e in a for a in args[0:i]) and all(e in a for a in args[i+1:])]) return rtr >>> return_equal(a,b,c) [[2, 4], [1, 3], [0, 1]] 
0
source

This solution basically combines all 1D input arrays into one large 1D array with the intention of performing the required operations in a vectorized manner . The only place where he uses the loop is at the beginning, where he gets the lengths of the input arrays, which should be minimal at the cost of runtime.

Here's the implementation of the function -

 import numpy as np def return_equals(*argv): # Concatenate input arrays into one big array for vectorized processing A = np.concatenate((argv[:])) # lengths of input arrays narr = len(argv) lens = np.zeros((1,narr),int).ravel() for i in range(narr): lens[i] = len(argv[i]) N = A.size # Start indices of each group of identical elements from different input arrays # in a sorted version of the huge concatenated input array start_idx = np.where(np.append([True],np.diff(np.sort(A))!=0))[0] # Runlengths of islands of identical elements runlens = np.diff(np.append(start_idx,N)) # Starting and all indices of the positions in concatenate array that has # islands of identical elements which are present across all input arrays good_start_idx = start_idx[runlens==narr] good_all_idx = good_start_idx[:,None] + np.arange(narr) # Get offsetted indices and sort them to get the desired output idx = np.argsort(A)[good_all_idx] - np.append([0],lens[:-1].cumsum()) return np.sort(idx.T,1) 
0
source

All Articles