Consider the following code:
#include <limits> #include <cstdint> using T = uint32_t; // or uint64_t T shift(T x, T y, T n) { return (x >> n) | (y << (std::numeric_limits<T>::digits - n)); }
According to godbolt , clang 3.8.1 generates the following assembly code for -O1, -O2, -O3:
shift(unsigned int, unsigned int, unsigned int): movb %dl, %cl shrdl %cl, %esi, %edi movl %edi, %eax retq
So far, gcc 6.2 (even with -mtune=haswell ) generates:
shift(unsigned int, unsigned int, unsigned int): movl $32, %ecx subl %edx, %ecx sall %cl, %esi movl %edx, %ecx shrl %cl, %edi movl %esi, %eax orl %edi, %eax ret
This seems much less optimized since SHRD runs very quickly on Intel Sandybridge and later . Does the function need to be rewritten in order to facilitate optimization by compilers (and, in particular, gcc), and support the use of SHLD / SHRD build instructions?
Or are there any gcc -mtune or other parameters that would encourage gcc to better tune into modern Intel processors?
With -march=haswell it emits BMI2 shlx / shrx, but still not shrd.