Speed โ€‹โ€‹up Python / Cython loops.

I am trying to run a loop in python to work as fast as possible. So I dived into NumPy and Cython. Here is the Python source code:

def calculate_bsf_u_loop(uvel,dy,dz): """ Calculate barotropic stream function from zonal velocity uvel (t,z,y,x) dy (y,x) dz (t,z,y,x) bsf (t,y,x) """ nt = uvel.shape[0] nz = uvel.shape[1] ny = uvel.shape[2] nx = uvel.shape[3] bsf = np.zeros((nt,ny,nx)) for jn in range(0,nt): for jk in range(0,nz): for jj in range(0,ny): for ji in range(0,nx): bsf[jn,jj,ji] = bsf[jn,jj,ji] + uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji] return bsf 

This is just the sum of the k indices. Array sizes: nt = 12, nz = 75, ny = 559, nx = 1442, so ~ 725 million elements. It took 68 seconds. Now I did it in cython as

 import numpy as np cimport numpy as np cimport cython @cython.boundscheck(False) # turn off bounds-checking for entire function @cython.wraparound(False) # turn off negative index wrapping for entire function ## Use cpdef instead of def ## Define types for arrays cpdef calculate_bsf_u_loop(np.ndarray[np.float64_t, ndim=4] uvel, np.ndarray[np.float64_t, ndim=2] dy, np.ndarray[np.float64_t, ndim=4] dz): """ Calculate barotropic stream function from zonal velocity uvel (t,z,y,x) dy (y,x) dz (t,z,y,x) bsf (t,y,x) """ ## cdef the constants cdef int nt = uvel.shape[0] cdef int nz = uvel.shape[1] cdef int ny = uvel.shape[2] cdef int nx = uvel.shape[3] ## cdef loop indices cdef ji,jj,jk,jn ## cdef. Note that the cdef is followed by cython type ## but the np.zeros function as python (numpy) type cdef np.ndarray[np.float64_t, ndim=3] bsf = np.zeros([nt,ny,nx], dtype=np.float64) for jn in xrange(0,nt): for jk in xrange(0,nz): for jj in xrange(0,ny): for ji in xrange(0,nx): bsf[jn,jj,ji] += uvel[jn,jk,jj,ji] * dz[jn,jk,jj,ji] * dy[jj,ji] return bsf 

and it took 49 seconds. However, replacing the loop for

 for jn in range(0,nt): for jk in range(0,nz): bsf[jn,:,:] = bsf[jn,:,:] + uvel[jn,jk,:,:] * dz[jn,jk,:,:] * dy[:,:] 

takes only 0.29 seconds! Unfortunately, I cannot do this in my full code.

Why does NumPy cut much faster than the Cython loop? I thought NumPy was fast because it is Cython under the hood. Shouldn't they be the same speed?

As you can see, I turned off border checking in the ziton, and I also compiled using โ€œfast mathโ€. However, this gives only slight acceleration. Is there anyway that the loop has the same speed as slicing NumPy, or is the loop always slower than slicing?

Any help is much appreciated! / Joachim

+7
performance python arrays numpy cython
source share
1 answer

This code screams for numpy.einsum's intervention, given that you are doing elementwise-multiplication , and then sum-reduction on the second axis is a 4D array, which essenti is ally of numpy.einsum do this very efficiently. To solve your case, you can use numpy.einsum two ways -

 bsf = np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy) bsf = np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy 

Run-time tests and exit checks -

 In [100]: # Take a (1/5)th of original input shapes ...: original_shape = [12,75, 559,1442] ...: m,n,p,q = (np.array(original_shape)/5).astype(int) ...: ...: # Generate random arrays with given shapes ...: uvel = np.random.rand(m,n,p,q) ...: dy = np.random.rand(p,q) ...: dz = np.random.rand(m,n,p,q) ...: In [101]: bsf = calculate_bsf_u_loop(uvel,dy,dz) In [102]: print(np.allclose(bsf,np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy))) True In [103]: print(np.allclose(bsf,np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy)) True In [104]: %timeit calculate_bsf_u_loop(uvel,dy,dz) 1 loops, best of 3: 2.16 s per loop In [105]: %timeit np.einsum('ijkl,ijkl,kl->ikl',uvel,dz,dy) 100 loops, best of 3: 3.94 ms per loop In [106]: %timeit np.einsum('ijkl,ijkl->ikl',uvel,dz)*dy 100 loops, best of 3: 3.96 ms per loo 
+4
source share

All Articles