Using g ++ uses SHLD / SHRD instructions

Consider the following code:

#include <limits> #include <cstdint> using T = uint32_t; // or uint64_t T shift(T x, T y, T n) { return (x >> n) | (y << (std::numeric_limits<T>::digits - n)); } 

According to godbolt , clang 3.8.1 generates the following assembly code for -O1, -O2, -O3:

 shift(unsigned int, unsigned int, unsigned int): movb %dl, %cl shrdl %cl, %esi, %edi movl %edi, %eax retq 

So far, gcc 6.2 (even with -mtune=haswell ) generates:

 shift(unsigned int, unsigned int, unsigned int): movl $32, %ecx subl %edx, %ecx sall %cl, %esi movl %edx, %ecx shrl %cl, %edi movl %esi, %eax orl %edi, %eax ret 

This seems much less optimized since SHRD runs very quickly on Intel Sandybridge and later . Does the function need to be rewritten in order to facilitate optimization by compilers (and, in particular, gcc), and support the use of SHLD / SHRD build instructions?

Or are there any gcc -mtune or other parameters that would encourage gcc to better tune into modern Intel processors?

With -march=haswell it emits BMI2 shlx / shrx, but still not shrd.

+6
source share
1 answer

No, I see no way to force gcc to use the SHRD .
You can control the output that gcc generates by changing the -mtune and -march .

Or are there any gcc -mtune or other options that would encourage gcc to better tune into modern Intel processors?

Yes, you can get gcc to generate BMI2 code :

For example: X86-64 GCC6.2 -O3 -march=znver1 //AMD Zen
Generates: (Haswell timer).

  code critical path latency reciprocal throughput --------------------------------------------------------------- mov eax, 32 * 0.25 sub eax, edx 1 0.25 shlx eax, esi, eax 1 0.5 shrx esi, edi, edx * 0.5 or eax, esi 1 0.25 ret TOTAL: 3 1.75 

Compared to clang 3.8.1:

  mov cl, dl 1 0.25 shrd edi, esi, cl 4 2 mov eax, edi * 0.25 ret TOTAL 5 2.25 

Given the chain of dependencies here: SHRD slower on Jasuel, tied to Sandybridge, slower on Skylake.
The reverse sequence is faster for the shrx sequence.

So, it depends on the post BMI processors gcc produces the best code, pre-BMI clang wins.
SHRD has wildly changing timings on different processors, I see why gcc doesn't like it too much.
Even with -Os (optimized for size) gcc still does not select SHRD .

*) It is not part of the time, because either it is not on the critical path, or it turns into a renaming of the register with zero delay.

+4
source

All Articles