The problem here is not what happens during the run, but what kind of optimization happens during compilation.
Which optimization is performed depends on the compiler (or even version), and there is no guarantee that every optimization that can be performed will be performed.
There are actually two different reasons why cython is slower, depending on whether you use g ++ or clang ++:
- g ++ cannot optimize due to
-fwrapv flag in -fwrapv layout - clang ++ cannot be optimized in the first place (read on to find out what will happen).
First problem (g ++) : Cython compiles with different flags compared to the flags of your pure C ++ program, and as a result, some optimizations cannot be performed.
If you look at the installation log, you will see:
x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native
As you said, -Ofast will win against -O2 because it comes last. But the problem is -fwrapv , which seems to prevent some optimization, since signal integer overflow cannot be considered UB and no longer be used for optimization.
So, you have the following options:
- add
-fno-wrapv to extra_compile_flags , the disadvantage is that all files are now compiled with changed flags, which may be undesirable. - create a library from cpp with only the necessary flags and bind it to your cython module. This solution has some overhead, but it has the advantage that it is reliable: since you pointed to different compilers, problems with different cython-flags may occur, so the first solution may be too fragile.
- Not sure if you can turn off the default flags, but maybe there is some information in the docs.
The second problem (clang ++) is embedded in a test cpp program.
When I compile your cpp program with my rather old version of the 5.4 version of g ++:
g++ test.cpp -o test -Ofast -march=native -fwrapv
it becomes almost 3 times slower compared to compiling without -fwrapv . This, however, is the weakness of the optimizer: when embedding it should see that it is impossible to overflow the signature integer (all sizes are about 256 ), so the -fwrapv flag -fwrapv not have any effect.
My old clang++ version (3.8) seems to do a better job here: with the flags above, I don't see any performance degradation. I need to disable inlining via -fno-inline to become slower code, but it is slower even without -fwrapv ie:
clang++ test.cpp -o test -Ofast -march=native -fno-inline
So, a systematic bias in favor of your C ++ program: the optimizer can optimize the code for known values ββafter nesting - something that cython cannot do.
So we can see: clang ++ could not optimize function diff with arbitrary sizes, but was able to optimize it for size = 256. However, Cython can only use the optimized version of diff . This is the reason why -fno-wrapv no positive effect.
My rejection of it: to prohibit embedding the function of interest (for example, compile it into your own object file) in the cpp tester to ensure the level using cython, otherwise you can see the performance of a program that was specially optimized for this single input.
NB: It's funny that if all int replaced with unsigned int s, then naturally -fwrapv does not play any role, but the version with unsigned int is as slow as the int version with -fwrapv , which is logical, since there is no undefined behavior to be used.