Huge numpy speed difference between similar code

Why is there such a big speed difference between the following L2 calculations:

a = np.arange(1200.0).reshape((-1,3)) %timeit [np.sqrt((a*a).sum(axis=1))] 100000 loops, best of 3: 12 µs per loop %timeit [np.sqrt(np.dot(x,x)) for x in a] 1000 loops, best of 3: 814 µs per loop %timeit [np.linalg.norm(x) for x in a] 100 loops, best of 3: 2 ms per loop 

All three give the same results, as far as I can see.

Here is the source code for the numpy.linalg.norm function:

 x = asarray(x) # Check the default case first and handle it immediately. if ord is None and axis is None: x = x.ravel(order='K') if isComplexType(x.dtype.type): sqnorm = dot(x.real, x.real) + dot(x.imag, x.imag) else: sqnorm = dot(x, x) return sqrt(sqnorm) 

EDIT: Someone suggested that one version could be parallelized, but I checked, and it is not. All three versions consume 12.5% ​​of the processor (as is usually the case with Python code on my 4 physical / 8-core Xeon Core Core processor.

+4
source share
1 answer

np.dot usually calls the BLAS library function - so its speed will depend on which BLAS library your version of numpy is associated with. In general, I would expect it to have a large fixed overhead, but will increase significantly as the size of the array increases. However, the fact that you call it from a list comprehension (in fact a regular Python for loop) will most likely negate any benefits of using BLAS.

If you get rid of the list comprehension and use axis= kwarg, np.linalg.norm is comparable to your first example, but np.einsum much faster than both:

 In [1]: %timeit np.sqrt((a*a).sum(axis=1)) The slowest run took 10.12 times longer than the fastest. This could mean that an intermediate result is being cached 100000 loops, best of 3: 11.1 µs per loop In [2]: %timeit np.linalg.norm(a, axis=1) The slowest run took 14.63 times longer than the fastest. This could mean that an intermediate result is being cached 100000 loops, best of 3: 13.5 µs per loop # this is what np.linalg.norm does internally In [3]: %timeit np.sqrt(np.add.reduce(a * a, axis=1)) The slowest run took 34.05 times longer than the fastest. This could mean that an intermediate result is being cached 100000 loops, best of 3: 10.7 µs per loop In [4]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a)) The slowest run took 5.55 times longer than the fastest. This could mean that an intermediate result is being cached 100000 loops, best of 3: 5.42 µs per loop 
+4
source

All Articles