Speed up Python / Cython loops.

Question

Speed up Python / Cython loops.

I am trying to run a loop in python to work as fast as possible. So I dived into NumPy and Cython. Here is the Python source code:

def calculate_bsf_u_loop(uvel,dy,dz): """ Calculate barotropic stream function from zonal velocity uvel (t,z,y,x) dy (y,x) dz (t,z,y,x) bsf (t,y,x) """ nt = uvel.shape[0] nz = uvel.shape[1] ny = uvel.shape[2] nx = uvel.shape[3] bsf = np.zeros((nt,ny,nx)) for jn in range(0,nt): for jk in range(0,nz): for jj in range(0,ny): for ji in range(0,nx): bsf[jn,jj,ji] = bsf[jn,jj,ji] + uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji] return bsf

This is just the sum of the k indices. Array sizes: nt = 12, nz = 75, ny = 559, nx = 1442, so ~ 725 million elements. It took 68 seconds. Now I did it in cython as

 import numpy as np cimport numpy as np cimport cython @cython.boundscheck(False) # turn off bounds-checking for entire function @cython.wraparound(False) # turn off negative index wrapping for entire function ## Use cpdef instead of def ## Define types for arrays cpdef calculate_bsf_u_loop(np.ndarray[np.float64_t, ndim=4] uvel, np.ndarray[np.float64_t, ndim=2] dy, np.ndarray[np.float64_t, ndim=4] dz): """ Calculate barotropic stream function from zonal velocity uvel (t,z,y,x) dy (y,x) dz (t,z,y,x) bsf (t,y,x) """ ## cdef the constants cdef int nt = uvel.shape[0] cdef int nz = uvel.shape[1] cdef int ny = uvel.shape[2] cdef int nx = uvel.shape[3] ## cdef loop indices cdef ji,jj,jk,jn ## cdef. Note that the cdef is followed by cython type ## but the np.zeros function as python (numpy) type cdef np.ndarray[np.float64_t, ndim=3] bsf = np.zeros([nt,ny,nx], dtype=np.float64) for jn in xrange(0,nt): for jk in xrange(0,nz): for jj in xrange(0,ny): for ji in xrange(0,nx): bsf[jn,jj,ji] += uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji] return bsf

and it took 49 seconds. However, replacing the loop for

 for jn in range(0,nt): for jk in range(0,nz): bsf[jn,:,:] = bsf[jn,:,:] + uvel[jn,jk,:,:] * dz[jn,jk,:,:] * dy[:,:]

takes only 0.29 seconds! Unfortunately, I cannot do this in my full code.

Why does NumPy cut much faster than the Cython loop? I thought NumPy was fast because it is Cython under the hood. Shouldn't they be the same speed?

As you can see, I turned off border checking in the ziton, and I also compiled using “fast math”. However, this gives only slight acceleration. Is there anyway that the loop has the same speed as slicing NumPy, or is the loop always slower than slicing?

Any help is much appreciated! / Joachim

+7

performance python arrays numpy cython

Joakim Nov 17 '15 at 12:37

source share

1 answer

Divakar · Accepted Answer · 2015-11-17T12:55:16+0000

This code screams for numpy.einsum's intervention, given that you are doing elementwise-multiplication , and then sum-reduction on the second axis is a 4D array, which essenti is ally of numpy.einsum do this very efficiently. To solve your case, you can use numpy.einsum two ways -

 bsf = np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy) bsf = np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy

Run-time tests and exit checks -

 In [100]: # Take a (1/5)th of original input shapes ...: original_shape = [12,75, 559,1442] ...: m,n,p,q = (np.array(original_shape)/5).astype(int) ...: ...: # Generate random arrays with given shapes ...: uvel = np.random.rand(m,n,p,q) ...: dy = np.random.rand(p,q) ...: dz = np.random.rand(m,n,p,q) ...: In [101]: bsf = calculate_bsf_u_loop(uvel,dy,dz) In [102]: print(np.allclose(bsf,np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy))) True In [103]: print(np.allclose(bsf,np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy)) True In [104]: %timeit calculate_bsf_u_loop(uvel,dy,dz) 1 loops, best of 3: 2.16 s per loop In [105]: %timeit np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy) 100 loops, best of 3: 3.94 ms per loop In [106]: %timeit np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy 100 loops, best of 3: 3.96 ms per loo

Speed ​​up Python / Cython loops.

More articles:

Speed up Python / Cython loops.