Why are Python loops over slices of numpy arrays faster than fully vectorized operations

Question

Why are Python loops over slices of numpy arrays faster than fully vectorized operations

I need to create a logical mask by spawning a 3D data array: the mask in places where the data is less than the lower acceptable limit or the data is greater than the upper acceptable limit should be set to True (otherwise False ). Compressed:

 mask = (data < low) or (data > high)

I have two versions of the code to perform this operation: one works directly with all 3D arrays in numpy , while the other method loops over the slices of the array. Contrary to my expectations, the second method is apparently faster than the first. Why???

 In [1]: import numpy as np In [2]: import sys In [3]: print(sys.version) 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] In [4]: print(np.__version__) 1.14.0 In [5]: arr = np.random.random((10, 1000, 1000)) In [6]: def method1(arr, low, high): ...: """ Fully vectorized computations """ ...: out = np.empty(arr.shape, dtype=np.bool) ...: np.greater_equal(arr, high, out) ...: np.logical_or(out, arr < low, out) ...: return out ...: In [7]: def method2(arr, low, high): ...: """ Partially vectorized computations """ ...: out = np.empty(arr.shape, dtype=np.bool) ...: for k in range(arr.shape[0]): ...: a = arr[k] ...: o = out[k] ...: np.greater_equal(a, high, o) ...: np.logical_or(o, a < low, o) ...: return out ...:

First of all, make sure that both methods give the same results:

 In [8]: np.all(method1(arr, 0.2, 0.8) == method2(arr, 0.2, 0.8)) Out[8]: True

And now some temporary tests:

 In [9]: %timeit method1(arr, 0.2, 0.8) 14.4 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [10]: %timeit method2(arr, 0.2, 0.8) 11.5 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What's going on here?

EDIT 1: Similar behavior is observed in an older environment:

 In [3]: print(sys.version) 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] In [4]: print(np.__version__) 1.11.3 In [9]: %timeit method1(arr, 0.2, 0.8) 100 loops, best of 3: 14.3 ms per loop In [10]: %timeit method2(arr, 0.2, 0.8) 100 loops, best of 3: 13 ms per loop

+7

python numpy

AGN Gazer Feb 20 '18 at 14:25

source share

2 answers

max9111 · Answer 1 · 2018-02-27T10:48:29+0000

Overcoming both methods

In method one, you access the array twice. If it does not fit into the cache, data will be read twice from RAM, which reduces performance. It is also possible that temporary arrays are created as indicated in the comments.

The second way is a more friendly cache, since you access the smaller part of the array twice, which is likely to be part of the cache. The disadvantages are a slow loop and more function calls, which are also rather slow.

To get good performance here, it is recommended that you compile code that can be executed using cython or numba. Since the cython version is another work (annotation, the need for a separate compiler), I will show how to do it with Numba.

 import numba as nb @nb.njit(fastmath=True, cache=True) def method3(arr, low, high): out = np.empty(arr.shape, dtype=nb.boolean) for i in range(arr.shape[0]): for j in range(arr.shape[1]): for k in range(arr.shape[2]): out[i,j,k]=arr[i,j,k] < low or arr[i,j,k] > high return out

Using arr = np.random.random((10, 1000, 1000)) this is twice your method_1 and your method_2 is 50 percent on my PC (Core i7-4771, python 3.5, windows)

This is just a simple example, on more complex code, where you can use SIMD, and parallel processing, which is also very easy to use, can be much more. In the case of non-compiled vectorization of vectors, usually, but not always (as shown), you can do it best, but this will always lead to poor cache behavior, which can lead to suboptimal performance if the pieces of data that you execute do not fit in at least in cache L3. For some other issues, performance gains will also be achieved if the data cannot fit in the much smaller L1 or L2 cache. Another advantage is the automatic embedding of small njited functions into an njited function that calls these functions.

Dennis soemers · Answer 2 · 2018-02-20T15:41:59+0000

In my own tests, the performance difference was even more noticeable than in your question. Differences were still clearly observed after an increase in the second and third sizes of the arr data. It also remained observable after commenting on one of the two comparison functions ( greater_equal or logical_or ), which means that we can eliminate some strange interaction between them.

By changing the implementation of the two methods to the following, I could significantly reduce the observed performance difference (but not completely eliminate it):

 def method1(arr, low, high): out = np.empty(arr.shape, dtype=np.bool) high = np.ones_like(arr) * high low = np.ones_like(arr) * low np.greater_equal(arr, high, out) np.logical_or(out, arr < low, out) return out def method2(arr, low, high): out = np.empty(arr.shape, dtype=np.bool) high = np.ones_like(arr) * high low = np.ones_like(arr) * low for k in range(arr.shape[0]): a = arr[k] o = out[k] h = high[k] l = low[k] np.greater_equal(a, h, o) np.logical_or(o, a < l, o) return out

I believe that when supplying high or low as a scalar for these numpy functions, they can first create an array with a zero sized regular shape filled with this scalar. When we do this manually outside functions, in both cases only once for the full form, the difference in performance becomes much less noticeable. This implies that for some reason (maybe a cache?), Creating such a large array filled with the same constant may be less efficient than creating smaller k arrays with the same constant (which is done automatically when implementing method2 in the original issue).

Note: in addition to narrowing the performance gap, it also significantly degrades the performance of both methods (affecting the second method more strongly than the first). So, although this may give some idea of where the problem is, it may not explain everything.

EDIT

Here is the new version of method2 , where we now manually create smaller arrays in a loop every time, for example, as I suspect, happens inside numpy in the original implementation in question:

 def method2(arr, low, high): out = np.empty(arr.shape, dtype=np.bool) for k in range(arr.shape[0]): a = arr[k] o = out[k] h = np.full_like(a, high) l = np.full_like(a, low) np.greater_equal(a, h, o) np.logical_or(o, a < l, o) return out

This version is indeed much faster than the previous one (confirming that creating multiple smaller arrays inside the loop is more efficient than one large outside the loop), but still slower than the original implementation in the question.

Assuming that these numpy functions really convert scalar boundaries to these types of arrays, the difference in performance between this last function and what in this question may be related to creating arrays in Python (my implementation) vs. do it initially (original implementation)

Why are Python loops over slices of numpy arrays faster than fully vectorized operations

More articles: