Can a shift using the CL register lead to an incomplete register table?

Can a variable shift generate a partial register (or register μops recombination) on ecx ? If so, on which microarchitecture (s)?

I tested this on Core2 (65nm), which seems to be read only by cl .

 _shiftbench: push rbx mov edx, -10000000 mov ecx, 5 _shiftloop: mov bl, 5 ; replace by cl to see possible recombining shl eax, cl add edx, 1 jnz _shiftloop pop rbx ret 

Replacing mov bl, 5 with mov cl, 5 did not matter what would happen if the registers were recombined, which can be demonstrated by replacing shl eax, cl with add eax, ecx (in my tests, the version with add experienced a 2.8x slowdown when writing to cl instead of bl ).


Test results:

  • Measure: no stall detected
  • Penryn: No Stall
  • Nehalem: no stall

Update: The new shrx group of shifts in Haswell really shows that stall. The shift-count argument is not written as an 8-bit register, so it might have been expected, but the textual representation really says nothing about such micro-architectural details.

+7
source share
1 answer

As indicated at the moment (“Maybe a shift using the CL register ...”), the title of the question contains its own answer: with a modern processor on CL, there is never a partial registration set, because CL can never be recombined from something less.

Yes, the processor knows that the amount you offset is actually contained in the CL, or rather, the 5 or 6 least significant bits of the CL. One of the ways he could stall on ECX was that the granularity with which he considered the dependencies of the teams did not exceed the full registers. This concern is outdated, though: the latest Intel processor, which would treat the entire ECX register as a dependency, was Pentium 4. See Agner Fog unofficial optimization guide , page 121. But again, with P4 this will not be called incomplete registration stagnation. a program can only be a victim of a false dependency (say, if CH was a modifier just before the shift).

+4
source

All Articles