Optimize double loop in python

I am trying to optimize the following loop:

def numpy(nx, nz, c, rho): for ix in range(2, nx-3): for iz in range(2, nz-3): a[ix, iz] = sum(c*rho[ix-1:ix+3, iz]) b[ix, iz] = sum(c*rho[ix-2:ix+2, iz]) return a, b 

I tried different solutions and found, using numba to calculate the amount of the product, leads to better results:

 import numpy as np import numba as nb import time @nb.autojit def sum_opt(arr1, arr2): s = arr1[0]*arr2[0] for i in range(1, len(arr1)): s+=arr1[i]*arr2[i] return s def numba1(nx, nz, c, rho): for ix in range(2, nx-3): for iz in range(2, nz-3): a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz]) b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz]) return a, b @nb.autojit def numba2(nx, nz, c, rho): for ix in range(2, nx-3): for iz in range(2, nz-3): a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz]) b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz]) return a, b nx = 1024 nz = 256 rho = np.random.rand(nx, nz) c = np.random.rand(4) a = np.zeros((nx, nz)) b = np.zeros((nx, nz)) ti = time.clock() a, b = numpy(nx, nz, c, rho) print 'Time numpy : ' + `round(time.clock() - ti, 4)` ti = time.clock() a, b = numba1(nx, nz, c, rho) print 'Time numba1 : ' + `round(time.clock() - ti, 4)` ti = time.clock() a, b = numba2(nx, nz, c, rho) print 'Time numba2 : ' + `round(time.clock() - ti, 4)` 

This will lead to

Numpy time: 4.1595

Numba1 time: 0.6993

Numba2 Time: 1.0135

Using the numba version of sum (sum_opt) is doing very well. But I am wondering why the numba version of the double loop function (numba2) leads to a slower runtime. I tried using jit instead of autojit by specifying argument types, but that was worse.

I also noticed that at first the cycle on the smallest cycle is slower than the cycle at first on the largest cycle. Is there any explanation?

I am sure that this double-loop function can be improved by a lot of vectorization of the problem (like this ) or by using another method (map?) But I'm a little confused about these methods.

In other parts of my code, I used the numba and numpy slicing methods to replace all explicit loops, but in this particular case, I cannot configure it.

Any ideas?

EDIT

Thanks for your comments. I worked a bit on this issue:

 import numba as nb import numpy as np from scipy import signal import time @nb.jit(['float64(float64[:], float64[:])'], nopython=True) def sum_opt(arr1, arr2): s = arr1[0]*arr2[0] for i in xrange(1, len(arr1)): s+=arr1[i]*arr2[i] return s @nb.autojit def numba1(nx, nz, c, rho, a, b): for ix in range(2, nx-3): for iz in range(2, nz-3): a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz]) b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz]) return a, b @nb.jit(nopython=True) def numba2(nx, nz, c, rho, a, b): for ix in range(2, nx-3): for iz in range(2, nz-3): a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz]) b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz]) return a, b @nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True) def numba3a(nx, nz, c, rho, a): for ix in range(2, nx-3): for iz in range(2, nz-3): a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz]) return a @nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True) def numba3b(nx, nz, c, rho, b): for ix in range(2, nx-3): for iz in range(2, nz-3): b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz]) return b def convol(nx, nz, c, aa, bb): s1 = rho[1:nx-1,2:nz-3] s2 = rho[0:nx-2,2:nz-3] kernel = c[:,None][::-1] aa[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid') bb[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid') return aa, bb nx = 1024 nz = 256 rho = np.random.rand(nx, nz) c = np.random.rand(4) a = np.zeros((nx, nz)) b = np.zeros((nx, nz)) ti = time.clock() for i in range(1000): a, b = numba1(nx, nz, c, rho, a, b) print 'Time numba1 : ' + `round(time.clock() - ti, 4)` ti = time.clock() for i in range(1000): a, b = numba2(nx, nz, c, rho, a, b) print 'Time numba2 : ' + `round(time.clock() - ti, 4)` ti = time.clock() for i in range(1000): a = numba3a(nx, nz, c, rho, a) b = numba3b(nx, nz, c, rho, b) print 'Time numba3 : ' + `round(time.clock() - ti, 4)` ti = time.clock() for i in range(1000): a, b = convol(nx, nz, c, a, b) print 'Time convol : ' + `round(time.clock() - ti, 4)` 

Your solution is very elegant Divakar, but I have to use this feature in my code a lot of times. So for 1000 iterations this leads to

Numba1 Time: 3.2487

Numba2 Time: 3.7012

Numba3 Time: 3.2088

Convolution Time: 22.7696

autojit and jit very close. However, when using jit seems important to specify all types of arguments.

I don't know if there is a way to specify argument types in the jit decorator when the function has multiple outputs. Someone?

At the moment, I have not found a solution other than using numba. New ideas are welcome!

+8
performance python loops numpy numba
source share
4 answers

Numba works very fast in nopython mode, but with your code it should return to object mode, which is a lot slower. This can be seen if you pass nopython=True jit decorator.

It compiles in nopython mode (at least in Numba version 0.18.2) if you pass a and b as arguments:

 import numba as nb @nb.jit(nopython=True) def sum_opt(arr1, arr2): s = arr1[0]*arr2[0] for i in range(1, len(arr1)): s+=arr1[i]*arr2[i] return s @nb.jit(nopython=True) def numba2(nx, nz, c, rho, a, b): for ix in range(2, nx-3): for iz in range(2, nz-3): a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz]) b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz]) return a, b 

Note that the notes note that autojit deprecated in favor of jit .


Apparently, you are not satisfied yet. So what about stride_tricks based stride_tricks ?

 from numpy.lib.stride_tricks import as_strided def stridetrick_einsum(c, rho, out): ws = len(c) nx, nz = rho.shape shape = (nx-ws+1, ws, nz) strides = (rho.strides[0],) + rho.strides rho_windowed = as_strided(rho, shape, strides) np.einsum('j,ijk->ik', c, rho_windowed, out=out) a = np.zeros((nx, nz)) stridetrick_einsum(c, rho[1:-1,2:-3], a[2:-3,2:-3]) b = np.zeros((nx, nz)) stridetrick_einsum(c, rho[0:-2,2:-3], b[2:-3,2:-3]) 

What else, since a and b are obviously almost the same, you can calculate them at a time, and then copy the values:

 a = np.zeros((nx, nz)) stridetrick_einsum(c, rho[:-1,2:-3], a[1:-3,2:-3]) b = np.zeros((nx, nz)) b[2:-3,2:-3] = a[1:-4,2:-3] a[1,2:-3] = 0.0 
+1
source share

Basically, you do 2D convolution with a slight modification that your kernel does not reverse, as convolution usually works. So basically, we need to do two things to use signal.convolve2d to solve our case -

  • Draw the rho input array to select the part that is used in the original loopback version of your code. This will be the input for the convolution.
  • Return the core c and feed it along with the chopped data to signal.convolve2d .

Please note that this must be done to calculate both a and b separately.

Here's the implementation -

 import numpy as np from scipy import signal # Slices for convolutions to get a and b respectively s1 = rho[1:nx-1,2:nz-3] s2 = rho[0:nx-2,2:nz-3] kernel = c[:,None][::-1] # convolution kernel # Setup output arrays and fill them with convolution results a = np.zeros((nx, nz)) b = np.zeros((nx, nz)) a[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid') b[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid') 

If you don't need extra zeros around the boundaries of the output arrays, you can just use the outputs from signal.convolve2d as they are, which should further improve performance.

Runtime tests

 In [532]: %timeit loop_based(nx, nz, c, rho) 1 loops, best of 3: 1.52 s per loop In [533]: %timeit numba1(nx, nz, c, rho) 1 loops, best of 3: 282 ms per loop In [534]: %timeit numba2(nx, nz, c, rho) 1 loops, best of 3: 509 ms per loop In [535]: %timeit conv_based(nx, nz, c, rho) 10 loops, best of 3: 15.5 ms per loop 

Thus, for actual data entry, the proposed convolution-based approach is 100x faster than the loop code and about 20x better than the fastest numba numba1 .

+7
source share

You are not fully utilizing the capabilities of numpy. An unexpected way to solve your problem would be something like:

 cs = np.zeros((nx+1, nz)) np.cumsum(c*rho, axis=0, out=cs[1:]) aa = cs[5:, 2:-3] - cs[1:-4, 2:-3] bb = cs[4:-1, 2:-3] - cs[:-5, 2:-3] 

aa will now hold the central non-zero part of your array a :

 >>> a[:5, :5] array([[ 0. , 0. , 0. , 0. , 0. ], [ 0. , 0. , 0. , 0. , 0. ], [ 0. , 0. , 2.31296595, 2.15743042, 2.5853117 ], [ 0. , 0. , 2.02697233, 2.83191016, 2.58819583], [ 0. , 0. , 2.4086584 , 2.45175615, 2.19628507]]) >>>aa[:3, :3] array([[ 2.31296595, 2.15743042, 2.5853117 ], [ 2.02697233, 2.83191016, 2.58819583], [ 2.4086584 , 2.45175615, 2.19628507]]) 

and similarly for bb and b .

On my system with your input example, this code is more than 300 times faster than your numpy function. According to your timings, it will be one or two orders of magnitude faster than numba.

+6
source share

As indicated in the section on the continuum blog, autojit compiles on time, and jit compiles ahead of time

Numba can compile right on time with the autojit decorator or ahead of time with the jit decorator

This means that in many cases autojit means that the compiler can make a more educated guess about the code that it compiles and optimize after that. I know that timely compilation ahead of time sounds contradictory, but hey.

But I am wondering why the numba version of the double-loop function (numba2) leads to a slower runtime

Numba does not increase the performance of arbitrary function calls. Although I cannot say for sure, I assume that the overhead of compiling JIT outweighs the benefits of doing it (if there is any benefit whatsoever).

I also noticed that the loop at first on the smallest loop is slower than the loop on the largest loop at first. Is there any explanation?

This is probably due to the lack of cache . A two-dimensional array is allocated as a continuous piece of memory of size rows * columns . What is selected in the cache is based on a combination of temporal (which was recently used) and spatial (which is close in memory to what was used), that is, it is supposed to be used later.

The first time you repeat along the lines, you iterate in the order in which the data appears in memory. The first time you iterate through the columns, each time you “skip” the line width in memory, which makes it less likely that access to the memory location that you are accessing is cached.

 2D array: [[1,2,3],[4,5,6],[7,8,9]] In memory: 1 2 3 4 5 6 7 8 9 

Suppose that for a simplified, stupid cache fetch algorithm, you can extract 3 subsequent memory cells.

Iteration of the first line:

 In memory: 1 2 3 | 4 5 6 | 7 8 9 Accessed: 1 2 3 | 4 5 6 | 7 8 9 Cache miss: - - - | - - - | - - - 

Column iteration:

 In memory: 1 2 3 | 4 5 6 | 7 8 9 Accessed: 1 4 7 | 2 5 8 | 3 6 9 Cache miss: - - - | xxx | xxx 
+2
source share

All Articles