Performance of various numpy fantasy indexing methods, also with numba

Question

Performance of various numpy fantasy indexing methods, also with numba

Since for my program a fairly quick indexing of Numpy arrays Numpy quite necessary, and the indexing fantasy does not have a good reputation, given the performance, I decided to do some tests. Especially since Numba developing quite fast, I tried which methods work well with numba.

As input, I used the following arrays for the test with small arrays:

 import numpy as np import numba as nb x = np.arange(0, 100, dtype=np.float64) # array to be indexed idx = np.array((0, 4, 55, -1), dtype=np.int32) # fancy indexing array bool_mask = np.zeros(x.shape, dtype=np.bool) # boolean indexing mask bool_mask[idx] = True # set same elements as in idx True y = np.zeros(idx.shape, dtype=np.float64) # output array y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64) #bool output array (only for convenience)

And the following arrays for my test of large arrays ( y_bool needed here to handle the numbers of the fools from randint ):

 x = np.arange(0, 1000000, dtype=np.float64) idx = np.random.randint(0, 1000000, size=int(1000000/50)) bool_mask = np.zeros(x.shape, dtype=np.bool) bool_mask[idx] = True y = np.zeros(idx.shape, dtype=np.float64) y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)

This gives the following timings without using numba:

 %timeit x[idx] #1.08 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) #large arrays: 129 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit x[bool_mask] #482 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) #large arrays: 621 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit np.take(x, idx) #2.27 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # large arrays: 112 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit np.take(x, idx, out=y) #2.65 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # large arrays: 134 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit x.take(idx) #919 ns ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 108 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit x.take(idx, out=y) #1.79 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # larg arrays: 131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit np.compress(bool_mask, x) #1.93 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 618 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit np.compress(bool_mask, x, out=y_bool) #2.58 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # large arrays: 637 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit x.compress(bool_mask) #900 ns ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 628 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit x.compress(bool_mask, out=y_bool) #1.78 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 628 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit np.extract(bool_mask, x) #5.29 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # large arrays: 641 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And with Numba , using jitting in nopython -mode, cach ing and nogil , I graced the indexing methods that Numba supports:

 @nb.jit(nopython=True, cache=True, nogil=True) def fancy(x, idx): x[idx] @nb.jit(nopython=True, cache=True, nogil=True) def fancy_bool(x, bool_mask): x[bool_mask] @nb.jit(nopython=True, cache=True, nogil=True) def taker(x, idx): np.take(x, idx) @nb.jit(nopython=True, cache=True, nogil=True) def ndtaker(x, idx): x.take(idx)

This gives the following results for small and large arrays:

 %timeit fancy(x, idx) #686 ns ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 84.7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit fancy_bool(x, bool_mask) #845 ns ± 31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 843 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit taker(x, idx) #814 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 87 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit ndtaker(x, idx) #831 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # large arrays: 85.4 µs ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Summary

While for numpy without numba it is clear that small arrays are best indexed with boolean masks (about 2 times compared to ndarray.take(idx) ), for larger arrays ndarray.take(idx) will work best. in this case, about 6 times faster than logical indexing. The breakeven point is in an array of about 1000 with an index size and size of about 20 .
For arrays with elements 1e5 and array size 5e3 size of ndarray.take(idx) will be about 10 times faster than indexing a boolean mask. Thus, it seems that Boolean indexing slows significantly with the size of the array, but gets caught a bit after reaching a certain size of the array.

For numba jitted functions, there is a slight acceleration for all indexing functions except for indexing a boolean mask. Simple fancy indexing works best here, but is still slower than logical masking without jing.
For large arrays, indexing Boolean masks is much slower than other methods, and even slower than the unchanged version. The other three methods work well and are about 15% faster than the non-recording version.

For my case with many arrays of different sizes, convenient indexing with numba is the best way to go. Perhaps some other people may also find useful information in this rather long post.

Edit:
I'm sorry that I forgot to ask my question, which I have. I just typed it quickly at the end of the day and completely forgot about it ... Well, do you know what is the best and fastest method than the ones I tested? Using Cython, my timings were between Numba and Python.
Since the index array is predefined once and is used without changing long iterations, any way to predefine the indexing process will be great. For this, I thought about using a step. But I could not pre-define a custom set of steps. Is it possible to get a predefined view in memory using steps?

Edit 2:
I think I will translate my question about predefined constant arrays of indexes that will be used in a single array of values (where only the values change, but not the form) for several million times in iterations to a new and more specific question. This question was too general, and perhaps I also formulated this question a little incorrectly. I will post a link here as soon as I open a new question! Here is the link to the next question.

+7

performance python numpy indexing numba

Scotty1- Sep 04 '17 at 17:29

source share

1 answer

Mseifert · Accepted Answer · 2017-09-04T21:02:01+0000

Your resume is not entirely correct, you have already done tests with arrays of different sizes, but one thing you did not do is change the number of indexed elements.

I limited it to pure indexing and skipped take (which is effectively the index of the integer array) and compress and extract (because it is efficient indexing of Boolean arrays). The only difference for them is the constant factors. The constant coefficient for the take and compress methods will be less than the overhead for the numpy np.take and np.compress , but otherwise the effects will be negligible for reasonable size arrays.

Just let me introduce it with different numbers:

 # ~ every 500th element x = np.arange(0, 1000000, dtype=np.float64) idx = np.random.randint(0, 1000000, size=int(1000000/500)) # changed the ratio! bool_mask = np.zeros(x.shape, dtype=np.bool) bool_mask[idx] = True %timeit x[idx] # 51.6 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) %timeit x[bool_mask] # 1.03 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # ~ every 50th element idx = np.random.randint(0, 1000000, size=int(1000000/50)) # changed the ratio! bool_mask = np.zeros(x.shape, dtype=np.bool) bool_mask[idx] = True %timeit x[idx] # 1.46 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit x[bool_mask] # 2.69 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # ~ every 5th element idx = np.random.randint(0, 1000000, size=int(1000000/5)) # changed the ratio! bool_mask = np.zeros(x.shape, dtype=np.bool) bool_mask[idx] = True %timeit x[idx] # 14.9 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit x[bool_mask] # 8.31 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And what happened here? It's simple: indexing entire arrays requires only access to the number of elements, since there are values in the indexed array. This means that if there are a few matches, it will be pretty fast, but slower if there are many indexes. However, indexing Boolean arrays should always go through the entire logical array and check for true values. This means that it should be approximately “constant” for the array.

But wait, it is not constant for Boolean arrays, and why does it take longer to index entire arrays (the latter case) than indexing Boolean arrays, even if it needs to process ~ 5 times fewer elements?

What is getting harder there. In this case, the logical array had True in random places, which means that it will be prone to branch prediction failures . This will be more likely if True and False have the same occurrences, but in random places. That's why boolean array indexing has become slower because the True to False ratio has become more equal and therefore more "random". Also, the array of results will be greater if there is more True , which also consumes more time.

As an example for this thing, branch predictions use this as an example (may differ for different systems / compilers):

 bool_mask = np.zeros(x.shape, dtype=np.bool) bool_mask[:1000000//2] = True # first half True, second half False %timeit x[bool_mask] # 5.92 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) bool_mask = np.zeros(x.shape, dtype=np.bool) bool_mask[::2] = True # True and False alternating %timeit x[bool_mask] # 16.6 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) bool_mask = np.zeros(x.shape, dtype=np.bool) bool_mask[::2] = True np.random.shuffle(bool_mask) # shuffled %timeit x[bool_mask] # 18.2 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Thus, the distribution of True and False will critically affect runtime using Boolean masks, even if they contain the same amount of True s! The same effect will be visible for compress functions.

For indexing an integer array (and similarly to np.take ), another effect will be shown: cache locality . The indices in your case are randomly allocated, so your computer must do a lot of RAM to load the "processor cache", because it is very unlikely that the two indexes will be next to each other.

Compare this:

 idx = np.random.randint(0, 1000000, size=int(1000000/5)) %timeit x[idx] # 15.6 ms ± 703 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) idx = np.random.randint(0, 1000000, size=int(1000000/5)) idx = np.sort(idx) # sort them %timeit x[idx] # 4.33 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comparing the indices, the chances are significantly increased, since the next value will already be in the cache, and this can lead to huge accelerations. This is a very important factor if you know that indexes will be sorted (for example, if they were created using np.where , they are sorted, which makes the result of np.where especially effective for indexing).

Thus, this is not the way indexing whole arrays is slower for small arrays and faster for large arrays, it depends on a much larger number of factors. Both have their own precedents and, depending on the circumstances, can be (significantly) faster than the others.

Let me also talk a little about numba features. First, some general statements:

cache will not affect, it just avoids recompiling the function. In interactive environments, this is almost useless. This is faster if you pack functions in a module.
nogil alone will not provide speed acceleration. It will be faster if it calls in different threads, because each execution of the function can release the GIL, and then several calls can be executed in parallel.

Otherwise, I don’t know how numba effectively implements these functions, however, when you use the NumPy functions in numba, it can be slower or faster - but even if it is faster, it will not be much faster (except, perhaps, for small arrays), because if it could be done faster, NumPy developers also implemented it. My rule of thumb is: if you can do this (vectorized) using NumPy, don't worry about numba. Only if you cannot do this with the vectorized NumPy functions, or NumPy will use too many temporary arrays, then numba will shine!

Performance of various numpy fantasy indexing methods, also with numba

More articles: