Find indices of a large array if it contains values ​​in a smaller array

Is there a quick numpy function to return a list of indices in a larger array, where does it correspond to values ​​from a smaller array? The smaller array is ~ 30M and greater than 800M, so I want to avoid a loop for numpy.where calls.

The problem with searchsorted is that it will return results, even if they are not an exact match, it just gives the closest index, but I only need indexes where there are exact matches

instead of this:

 >>> a = array([1,2,3,4,5]) >>> b = array([2,4,7]) >>> searchsorted(a,b) array([1, 3, 5]) 

I would like to:

 >>> a = array([1,2,3,4,5]) >>> b = array([2,4,7]) >>> SOMEFUNCTION(a,b) array([1, 3]) 

EDIT: the set of values ​​in both small and large arrays is always unique and sorted.

+4
source share
2 answers

You can use np.in1d to find those elements of a that are in b . To find the index, use one call to np.where :

 In [34]: a = array([1,2,3,4,5]) In [35]: b = array([2,4,7]) In [36]: np.in1d(a, b) Out[38]: array([False, True, False, True, False], dtype=bool) In [39]: np.where(np.in1d(a, b)) Out[39]: (array([1, 3]),) 

Since a and b already sorted, you can use

 In [57]: np.searchsorted(b, a, side='right') != np.searchsorted(b, a, side='left') Out[57]: array([False, True, False, True, False], dtype=bool) 

instead of np.in1d(a, b) . For large a and b using searchsorted can be faster:

 import numpy as np a = np.random.choice(10**7, size=10**6, replace=False) a.sort() b = np.random.choice(10**7, size=10**5, replace=False) b.sort() In [53]: %timeit np.in1d(a, b) 10 loops, best of 3: 176 ms per loop In [54]: %timeit np.searchsorted(b, a, side='right') != np.searchsorted(b, a, side='left') 10 loops, best of 3: 106 ms per loop 

Jaime and Divakar have proposed some significant improvements regarding the method shown above. Here is some code that checks that all methods return the same result, and then some control values:

 import numpy as np a = np.random.choice(10**7, size=10**6, replace=False) a.sort() b = np.random.choice(10**7, size=10**5, replace=False) b.sort() def using_searchsorted(a, b): return (np.where(np.searchsorted(b, a, side='right') != np.searchsorted(b, a, side='left')))[0] def using_in1d(a, b): return np.where(np.in1d(a, b))[0] def using_searchsorted_divakar(a, b): idx1 = np.searchsorted(a,b,'left') idx2 = np.searchsorted(a,b,'right') out = idx1[idx1 != idx2] return out def using_jaime_mask(haystack, needle): idx = np.searchsorted(haystack, needle) mask = idx < haystack.size mask[mask] = haystack[idx[mask]] == needle[mask] idx = idx[mask] return idx expected = using_searchsorted(a, b) for func in (using_in1d, using_searchsorted_divakar, using_jaime_mask): result = func(a, b) assert np.allclose(expected, result) 

 In [29]: %timeit using_jaime_mask(a, b) 100 loops, best of 3: 13 ms per loop In [28]: %timeit using_searchsorted_divakar(a, b) 10 loops, best of 3: 21.7 ms per loop In [26]: %timeit using_searchsorted(a, b) 10 loops, best of 3: 109 ms per loop In [27]: %timeit using_in1d(a, b) 10 loops, best of 3: 173 ms per loop 
+5
source

The default seacrh direction with np.searchsorted is left . We can also look for it from the right direction, and those that are the same for both of these indices will be the ones that should be avoided from the indices deduced from the left option to get the desired result. The motivation here is the same as discussed in the @unutbu solution .

Thus, the implementation will look like this:

 idx1 = np.searchsorted(a,b,'left') idx2 = np.searchsorted(a,b,'right') out = idx1[idx1 != idx2] 
+4
source

All Articles