Numba: Manual loop faster than a + = c * b with numpy arrays?

I would like to make "daxpy" (add a scalar multiple of the second vector to the vector and assign the result to the first) using numpy using numba . After conducting the next test, I noticed that writing the loop myself was much faster than doing a += c * b .

I did not expect this. What is the reason for this behavior?

 import numpy as np from numba import jit x = np.random.random(int(1e6)) o = np.random.random(int(1e6)) c = 3.4 @jit(nopython=True) def test1(a, b, c): a += c * b return a @jit(nopython=True) def test2(a, b, c): for i in range(len(a)): a[i] += c * b[i] return a %timeit -n100 -r10 test1(x, o, c) >>> 100 loops, best of 10: 2.48 ms per loop %timeit -n100 -r10 test2(x, o, c) >>> 100 loops, best of 10: 1.2 ms per loop 
+5
source share
1 answer

One thing to keep in mind: the β€œmanual loop” in numba very fast, essentially the same as the c-loop used by numpy operations.

In the first example, there are two operations: a temporary array ( c * b ) is allocated / calculated, then this temporary array is added to a . In the second example, both calculations occur in one cycle without an intermediate result.

Theoretically, numba can merge loops and optimize # 1 to do the same thing as # 2, but it doesn't seem to. If you just want to optimize your numpy operations, numexpr also be worth what was designed specifically for this, but probably won't be better than an explicit fuse loop.

 In [17]: import numexpr as ne In [18]: %timeit -r10 test2(x, o, c) 1000 loops, best of 10: 1.36 ms per loop In [19]: %timeit ne.evaluate('x + o * c', out=x) 1000 loops, best of 3: 1.43 ms per loop 
+2
source

All Articles