Why is numpy.absolute () so slow?

I need to optimize a script that makes heavy use of the computational norm of L1 vectors. Since we know that the norm L1 in this case is simply the sum of the absolute values. When you determine how fast numpy is in this task, I found something strange: adding all vector elements is about 3 times faster than the absolute value of each vector element. This is an amazing result because adding is quite complicated compared to an absolute value that only requires zeroing every 32nd bit of the data block (assuming float32).

Why is this addition 3 times faster than a simple bitwise operation?

import numpy as np a = np.random.rand(10000000) %timeit np.sum(a) 13.9 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit np.abs(a) 41.2 ms ± 92.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
+7
optimization numpy absolute-value addition
source share
2 answers

There are a few things here. sum returns a scalar abs returns an array. Thus, even if adding two numbers and accepting the absolute would have the same abs speed, it would be slower because he needed to create an array. And it should process twice as many elements (reading from input + writing to output).

Thus, you cannot deduce from these timings anything about the rate of addition and bitwise operation.

However, you can check whether it is faster to add something to each value of the array and accept the absolute value of each value

 %timeit a + 0.1 9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit abs(a) 9.98 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Or compare the sum + memory allocation with absolute

 %timeit np.full_like(a, 1); np.sum(a) 13.4 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit abs(a) 9.64 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Just in case, if you want to calculate the norm faster, you can try numba (or Cython, or write a C or Fortran procedure yourself), so you avoid memory allocation:

 import numba as nb @nb.njit def sum_of_abs(arr): sum_ = 0 for item in arr: sum_ += abs(item) return sum_ sum_of_abs(a) # one call for the jitter to kick in %timeit sum_of_abs(a) # 2.44 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
+3
source share

np.sum returns a scalar. np.abs returns a new array of the same size. Allocating memory for this new array is what takes the most time. Compare

 >>> timeit("np.abs(a)", "import numpy as np; a = np.random.rand(10000000)", number=100) 3.565487278989167 >>> timeit("np.abs(a, out=a)", "import numpy as np; a = np.random.rand(10000000)", number=100) 0.9392949139873963 

The out=a argument tells NumPy to get the result in the same array a that overwrites the old data there. Hence the acceleration.

Amount is even a little faster:

 >>> timeit("np.sum(a)", "import numpy as np; a = np.random.rand(10000000)", number=100) 0.6874654769926565 

but it doesn’t require as much memory access for recording.

If you do not want to overwrite a, providing another array for output abs possible if you are forced to re-take abs arrays of the same type and size.

 b = np.empty_like(a) # done once, outside the loop np.abs(a, out=b) np.sum(b) 

runs after about half the time np.linalg(a, 1)

For reference, np.linalg computes the L1 norm as

 add.reduce(abs(x), axis=axis, keepdims=keepdims) 

which includes memory allocation for the new abs(x) array.


Ideally, it would be possible to calculate the sum (or maximum or minimum) of all absolute values ​​(or the results of another "ufunc") without moving the entire output to RAM, and then extract it for the sum / max / min. There was some discussion in NumPy repo , recently in add max_abs ufunc , but it has not reached implementation.

The ufunc.reduce method is available for functions with two inputs such as add or logaddexp , but there is no addabs function ( x, y : x+abs(y) ) to reduce with.

+2
source share

All Articles