Comparing two numpy arrays of different lengths

I need to find the indices of the first smaller or equal occurrence of elements of one array in another array. One way to work:

import numpy a = numpy.array([10,7,2,0]) b = numpy.array([10,9,8,7,6,5,4,3,2,1]) indices = [numpy.where(a<=x)[0][0] for x in b] 

have the value [0, 1, 1, 1, 2, 2, 2, 2, 2, 3], which is what I need. The problem, of course, is that the "for" loop for python is slow, and my arrays can have millions of elements. Is there any trick for this? This does not work because arrays do not have the same length:

 indices = numpy.where(a<=b) #XXX: raises an exception 

Thanks!

+7
python numpy
source share
2 answers

This may be a special case, but you should be able to use numpy digitize . It should be indicated here that the bunkers should monotonously decrease or increase.

 >>> import numpy >>> a = numpy.array([10,7,2,0]) >>> b = numpy.array([10,9,8,7,6,5,4,3,2,1]) >>> indices = [numpy.where(a<=x)[0][0] for x in b] [0, 1, 1, 1, 2, 2, 2, 2, 2, 3] >>> numpy.digitize(b,a) array([0, 1, 1, 1, 2, 2, 2, 2, 2, 3]) 

Setting for time test:

 a = np.arange(50)[::-1] b = np.random.randint(0,50,1E3) np.allclose([np.where(a<=x)[0][0] for x in b],np.digitize(b,a)) Out[55]: True 

Some timings:

 %timeit [np.where(a<=x)[0][0] for x in b] 100 loops, best of 3: 4.97 ms per loop %timeit np.digitize(b,a) 10000 loops, best of 3: 48.1 Β΅s per loop 

It seems that it is accelerating by two orders of magnitude, this will largely depend on the number of boxes. Your timings will be different.


To compare with Jamie's answer, I timed the following two code snippets. Since I basically wanted to focus on the speed of searchsorted vs digitize , I cut Jamie's code a bit. The corresponding snippet is here:

 a = np.arange(size_a)[::-1] b = np.random.randint(0, size_a, size_b) ja = np.take(a, np.searchsorted(a, b, side='right', sorter=a)-1) #Compare to digitize if ~np.allclose(ja,np.digitize(b,a)): print 'Comparison failed' timing_digitize[num_a,num_b] = timeit.timeit('np.digitize(b,a)', 'import numpy as np; from __main__ import a, b', number=3) timing_searchsorted[num_a,num_b] = timeit.timeit('np.take(a, np.searchsorted(a, b, side="right", sorter=a)-1)', 'import numpy as np; from __main__ import a, b', number=3) 

This slightly exceeds my limited matplotlib ability, so this is done in the DataGraph. I built a logarithmic relation timing_digitize/timing_searchsorted so that values ​​greater than zero are searchsorted faster and values ​​less than zero digitize faster. Colors also give relative speeds. For example, it is shown that in the upper right corner (a = 1E6, b = 1E6), digitize is ~ 300 times slower than searchsorted , while for smaller sizes, digitize can be up to 10 times faster. The black line is the breakeven point:

enter image description here It seems like raw searchsorted almost always faster for large cases for speed, but the simple digitize syntax is almost as good if the number of bins is small.

+11
source share

This is dirty, but it works:

 >>> idx = np.argsort(a) >>> np.take(idx, np.searchsorted(a, b, side='right', sorter=idx)-1) array([0, 1, 1, 1, 2, 2, 2, 2, 2, 3], dtype=int64) 

If your array is always sorted, you can get rid of calling argsort .

+4
source share

All Articles