For most architectures with fixed-width instructions, the answer is likely to be the boring mov instruction to immediately expand or invert the sign or mov lo / high pair. for example on ARM, mvn r0, #0 (do not move). See the gcc asm output for x86, ARM, ARM64, and MIPS in the Godbolt compiler explorer . IDK is something about zseries asm or machine code.
On ARM eor r0,r0,r0 significantly worse than mov -immediate. It depends on the old value, without special treatment. The rules for ordering dependencies in memory do not allow ARM uarch to use it in a special register, even if they want it. The same applies to most other RISC ISAs with poorly ordered memory, but which do not require barriers to memory_order_consume (in C ++ 11 terminology).
x86 xor-zeroing is special because of its set of variable-length commands. Historically, the 8086 xor ax,ax was fast directly because it was small. Since this idiom has become widely used (and zeroing is much more common than all), the processor developers gave it special support, and now xor eax,eax faster than mov eax,0 in the Intel Sandybridge family and some other processors, even without direct and indirect effects of code size. See What is the best way to set the register to zero in the x86 assembly: xor, mov, or and? to learn about the benefits of microarchitecture that I managed to dig.
If x86 had a set of instructions of fixed width, I wonder if mov reg, 0 would get the same special mode as when resetting xor? Perhaps because breaking the dependency before writing low8 or low16 is important.
Standard options for better performance:
mov eax, -1 : 5 bytes using encoding mov r32, imm32 . (Unfortunately, the TG410 extension is missing). Excellent performance on all processors. 6 bytes for r8-r15 (REX prefix).mov rax, -1 : 7 bytes, using the encoding mov r/m64, sign-extended-imm32 . (Not version REX.W = 1 version of eax . It will be 10-byte mov r64, imm64 ). Excellent performance on all processors.
Strange options that preserve some code size, usually due to performance :
xor eax,eax / dec rax (or not rax ): 5 bytes (4 for 32-bit eax ). Disadvantage: two mops for the external interface. There is still only one unused UOP domain for scheduler / executables on recent Intel, where xor-zeroing is handled in the frontend. mov -immediate always needs an execution unit. (But integer ALU bandwidth is rarely a bottleneck for instructions that any port can use; the problem is the extra input pressure)xor ecx,ecx / lea eax, [rcx-1] Only 5 bytes for 2 constants (6 bytes for rax ): leaves a separate null register . If you already want to reset the register, then this disadvantage is almost nonexistent. lea can run on fewer ports than mov r,i on most processors, but since this is the beginning of a new chain of dependencies, the central processor can start it in any loop of the backup execution port after it throws an error.
The same trick works for any two adjacent constants, if you do the first with mov reg, imm32 , and the second with lea r32, [base + disp8] . disp8 has a range from -128 to +127, otherwise you need disp32 .
or eax, -1 : 3 bytes (4 for rax ) using the encoding or r/m32, sign-extended-imm8 . Disadvantage: false dependence on the old register value.
push -1 / pop rax : 3 bytes. Slowly but not enough. Recommended for exploits / code golf only. Works for any sign-extended-imm8 , unlike most others.
MINUSES:
- uses save and load blocks, not ALUs. (Perhaps the bandwidth advantage in rare cases in the AMD Bulldozer family, where there are only two integer execution channels, but the decoding / output / output bandwidth is higher than this. But do not try to do this without testing.)
- a save / reload delay means that
rax will not be ready for ~ 5 cycles, for example, after that on Skylake. - (Intel): puts the stack engine in rsp-modified mode, so the next time you read
rsp directly, it will synchronize the stack. (for example, for add rsp, 28 or for mov eax, [rsp+8] ). - The store may not be in the cache, causing additional memory traffic. (Maybe if you didn't touch the stack inside a long loop).
Vector registers are different
Setting vector registers to single with pcmpeqd xmm0,xmm0 has a special case on most processors as a dependency violation (not Silvermont / KNL), but still needs an executive module to actually write them. pcmpeqb/w/d/q everything works, but q slower on some processors.
For AVX2 , ymm equivalent of vpcmpeqd ymm0, ymm0, ymm0 also a better choice.
For AVX without AVX2, the choice is less obvious: there is no single obvious best approach. Compilers use different strategies : gcc prefers to load a 32-byte constant from vmovdqa , while the older clang uses the 128-bit vpcmpeqd followed by the cross line vinsertf128 to fill the upper half. A newer clan uses vxorps to vxorps register, and then vcmptrueps to populate it. This is the moral equivalent of the vpcmpeqd approach, but vxorps needed to eliminate the dependency on the previous version of the register, and the vcmptrueps delay is 3. This is a reasonable default choice.
Running vbroadcastss from a 32-bit value is probably strictly better than a loading approach, but it's hard to get compilers to generate this.
The best approach probably depends on the surrounding code.
Fastest way to set __m256 for all ONE bits
AVX512 comparisons are only available with a mask register (for example, k0 ) as the destination, so compilers currently use vpternlogd zmm0,zmm0,zmm0, 0xff as the all-unit idiom 512b. (0xff makes each element of the truth table with 3 inputs 1 ). This is not a special case, like a dependency violation on KNL or SKL, but has a bandwidth of 2 per clock frequency on the Skylake-AVX512. This is better than using narrower AVX devices that break dependencies and broadcasting or shuffling them.
If you need to regenerate all units within a loop, it is obvious that the most efficient way is to use vmov* to copy the register of all units. It doesnβt even use the executive module on modern processors (but still requires the bandwidth of the external interface). But if you don't have vector registers, loading a constant or [v]pcmpeq[b/w/d] is a good choice.
For the AVX512, you should try VPMOVM2D zmm0, k0 or maybe VPBROADCASTD zmm0, eax . Each of them has a bandwidth of only 1s , but they must break depending on the old zmm0 value (unlike vpternlogd ). They need a mask or integer register that you initialized outside the loop with kxnorw k1,k0,k0 or mov eax, -1 .
For mask registers , AVX512 , kxnorw k1,k0,k0 works, but this is not a violation of the dependence on current processors. Intel Optimization Guide suggests using it to generate units before the collection command, but recommends avoiding using the same input register as for output. This avoids a dependency independent of other assemblies from the previous one in the loop. Since k0 often not used, it is usually convenient to read.
I think vpcmpeqd k1, zmm0,zmm0 will work, but it probably doesn't have a special case as the idiom k0 = 1 without depending on zmm0. (To set all 64 bits instead of the low 16, use AVX512BW vpcmpeqb )
On Skylake-AVX512, k instructions that work with mask registers execute on only one port , even on simple ones like kandw . (Also note that Skylake-AVX512 will not start vector mops on port 1 when there are any 512-bit operations in the channel, so the bandwidth of the executive module can become a real bottleneck.)
No kmov k0, imm , only moves from an integer or memory. There are probably no instructions k in which the same is defined as special, so the equipment does not look for it for the registers k at the release / rename stage.