Slow C ++ function performance in Cython

Question

Slow C ++ function performance in Cython

I have this C ++ function that I can call from Python with the code below. Performance is only half compared to pure C ++. Is there a way to improve performance at the same level? I compile both codes using the -Ofast -march=native flags. I don’t understand where I can lose 50%, because most of the time I need to spend in the C ++ core. Is Cython a copy of memory that I can avoid?

 namespace diff { void diff_cpp(double* __restrict__ at, const double* __restrict__ a, const double visc, const double dxidxi, const double dyidyi, const double dzidzi, const int itot, const int jtot, const int ktot) { const int ii = 1; const int jj = itot; const int kk = itot*jtot; for (int k=1; k<ktot-1; k++) for (int j=1; j<jtot-1; j++) for (int i=1; i<itot-1; i++) { const int ijk = i + j*jj + k*kk; at[ijk] += visc * ( + ( (a[ijk+ii] - a[ijk ]) - (a[ijk ] - a[ijk-ii]) ) * dxidxi + ( (a[ijk+jj] - a[ijk ]) - (a[ijk ] - a[ijk-jj]) ) * dyidyi + ( (a[ijk+kk] - a[ijk ]) - (a[ijk ] - a[ijk-kk]) ) * dzidzi ); } } }

I have this .pyx file

 # import both numpy and the Cython declarations for numpy import cython import numpy as np cimport numpy as np # declare the interface to the C code cdef extern from "diff_cpp.cpp" namespace "diff": void diff_cpp(double* at, double* a, double visc, double dxidxi, double dyidyi, double dzidzi, int itot, int jtot, int ktot) @cython.boundscheck(False) @cython.wraparound(False) def diff(np.ndarray[double, ndim=3, mode="c"] at not None, np.ndarray[double, ndim=3, mode="c"] a not None, double visc, double dxidxi, double dyidyi, double dzidzi): cdef int ktot, jtot, itot ktot, jtot, itot = at.shape[0], at.shape[1], at.shape[2] diff_cpp(&at[0,0,0], &a[0,0,0], visc, dxidxi, dyidyi, dzidzi, itot, jtot, ktot) return None

I call this function in Python

 import numpy as np import diff import time nloop = 20; itot = 256; jtot = 256; ktot = 256; ncells = itot*jtot*ktot; at = np.zeros((ktot, jtot, itot)) index = np.arange(ncells) a = (index/(index+1))**2 a.shape = (ktot, jtot, itot) # Check results diff.diff(at, a, 0.1, 0.1, 0.1, 0.1) print("at={0}".format(at.flatten()[itot*jtot+itot+itot//2])) # Time the loop start = time.perf_counter() for i in range(nloop): diff.diff(at, a, 0.1, 0.1, 0.1, 0.1) end = time.perf_counter() print("Time/iter: {0} s ({1} iters)".format((end-start)/nloop, nloop))

This is setup.py :

 from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext import numpy setup( cmdclass = {'build_ext': build_ext}, ext_modules = [Extension("diff", sources=["diff.pyx"], language="c++", extra_compile_args=["-Ofast -march=native"], include_dirs=[numpy.get_include()])], )

And here is the C ++ link file that achieves double the performance:

 #include <iostream> #include <iomanip> #include <cstdlib> #include <stdlib.h> #include <cstdio> #include <ctime> #include "math.h" void init(double* const __restrict__ a, double* const __restrict__ at, const int ncells) { for (int i=0; i<ncells; ++i) { a[i] = pow(i,2)/pow(i+1,2); at[i] = 0.; } } void diff(double* const __restrict__ at, const double* const __restrict__ a, const double visc, const double dxidxi, const double dyidyi, const double dzidzi, const int itot, const int jtot, const int ktot) { const int ii = 1; const int jj = itot; const int kk = itot*jtot; for (int k=1; k<ktot-1; k++) for (int j=1; j<jtot-1; j++) for (int i=1; i<itot-1; i++) { const int ijk = i + j*jj + k*kk; at[ijk] += visc * ( + ( (a[ijk+ii] - a[ijk ]) - (a[ijk ] - a[ijk-ii]) ) * dxidxi + ( (a[ijk+jj] - a[ijk ]) - (a[ijk ] - a[ijk-jj]) ) * dyidyi + ( (a[ijk+kk] - a[ijk ]) - (a[ijk ] - a[ijk-kk]) ) * dzidzi ); } } int main() { const int nloop = 20; const int itot = 256; const int jtot = 256; const int ktot = 256; const int ncells = itot*jtot*ktot; double *a = new double[ncells]; double *at = new double[ncells]; init(a, at, ncells); // Check results diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); printf("at=%.20f\n",at[itot*jtot+itot+itot/2]); // Time performance std::clock_t start = std::clock(); for (int i=0; i<nloop; ++i) diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); double duration = (std::clock() - start ) / (double)CLOCKS_PER_SEC; printf("time/iter = %fs (%i iters)\n",duration/(double)nloop, nloop); return 0; }

+7

c ++ python numpy cython

Chiel Sep 29 '17 at 20:11

source share

1 answer

ead · Accepted Answer · 2017-09-29T21:42:55+0000

The problem here is not what happens during the run, but what kind of optimization happens during compilation.

Which optimization is performed depends on the compiler (or even version), and there is no guarantee that every optimization that can be performed will be performed.

There are actually two different reasons why cython is slower, depending on whether you use g ++ or clang ++:

g ++ cannot optimize due to -fwrapv flag in -fwrapv layout
clang ++ cannot be optimized in the first place (read on to find out what will happen).

First problem (g ++) : Cython compiles with different flags compared to the flags of your pure C ++ program, and as a result, some optimizations cannot be performed.

If you look at the installation log, you will see:

  x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native

As you said, -Ofast will win against -O2 because it comes last. But the problem is -fwrapv , which seems to prevent some optimization, since signal integer overflow cannot be considered UB and no longer be used for optimization.

So, you have the following options:

add -fno-wrapv to extra_compile_flags , the disadvantage is that all files are now compiled with changed flags, which may be undesirable.
create a library from cpp with only the necessary flags and bind it to your cython module. This solution has some overhead, but it has the advantage that it is reliable: since you pointed to different compilers, problems with different cython-flags may occur, so the first solution may be too fragile.
Not sure if you can turn off the default flags, but maybe there is some information in the docs.

The second problem (clang ++) is embedded in a test cpp program.

When I compile your cpp program with my rather old version of the 5.4 version of g ++:

  g++ test.cpp -o test -Ofast -march=native -fwrapv

it becomes almost 3 times slower compared to compiling without -fwrapv . This, however, is the weakness of the optimizer: when embedding it should see that it is impossible to overflow the signature integer (all sizes are about 256 ), so the -fwrapv flag -fwrapv not have any effect.

My old clang++ version (3.8) seems to do a better job here: with the flags above, I don't see any performance degradation. I need to disable inlining via -fno-inline to become slower code, but it is slower even without -fwrapv ie:

  clang++ test.cpp -o test -Ofast -march=native -fno-inline

So, a systematic bias in favor of your C ++ program: the optimizer can optimize the code for known values after nesting - something that cython cannot do.

So we can see: clang ++ could not optimize function diff with arbitrary sizes, but was able to optimize it for size = 256. However, Cython can only use the optimized version of diff . This is the reason why -fno-wrapv no positive effect.

My rejection of it: to prohibit embedding the function of interest (for example, compile it into your own object file) in the cpp tester to ensure the level using cython, otherwise you can see the performance of a program that was specially optimized for this single input.

NB: It's funny that if all int replaced with unsigned int s, then naturally -fwrapv does not play any role, but the version with unsigned int is as slow as the int version with -fwrapv , which is logical, since there is no undefined behavior to be used.

Slow C ++ function performance in Cython

More articles: