First of all, if you do other things with your 256b integers (for example, adding / subtracting / multiplying), getting them in vector registers just for random XOR may not be worth the overhead of porting them. If you have two numbers that are already written in registers (using up to 8 full registers), to get the result (and 4 mov instructions, only four xor commands are required) if you need to avoid overwriting the destination). The destructive version can work with one hour at 1.33 beats on SnB, or one at a time on Haswell and later. ( xor can work on any of the 4 ALU ports). Therefore, if you are just doing a single xor between some add/adc or others, stick with integers.
Saving memory in 64-bit fragments, and then executing a 128b or 256b load will cause the store to fail to resell , adding a few more delay cycles. Using movq / pinsrq will cost more runtime resources than xor . Another thing is not so bad: 256b store โ 64b downloads great for sending a store. movq / pextrq will still suck, but will have a lower latency (due to more uops).
FP loading / storing / bitwise operations are guaranteed architecturally so as not to generate FP exceptions, even if they are used in bit patterns that are NaN signaling. Only actual FP math instructions list math exceptions:
VADDPS
SIMD floating point exceptions
Overflow, Underflow, Invalid, Precision, Denormal.
VMOVAPS
SIMD floating point exceptions
Are absent.
(From the Intel insn ref manual. See the x86 wiki for links to this and other materials.)
On Intel hardware, any taste of loading / storing can go to FP or an integer domain without additional delay. AMD behaves similarly, whatever the taste of the load / storage, no matter where the data goes.
The various options for the vector move command really matter for moving the <-register register . In Intel Nehalem, using the wrong mov command can cause a bypass delay. In the AMD Bulldozer family, where moves are handled by renaming registers rather than copying data (for example, Intel IvB and later), the dest register inherits the domain of what the src register wrote.
No existing project that I read about handled movapd differently than movaps . Presumably, Intel created movapd as much as possible for ease of decoding, as for future planning (for example, for the possibility of creating a design where there is a double domain and one domain with different forwarding networks). ( movapd has movaps with a 66h prefix, just like the double version of every other SSE instruction has only a 66h byte prefix. Or F2 instead of F3 for scalar instructions.)
AMD seems to be developing FP vector tags with supporting information because Agner Fog detected a big delay when using the addps output as input for addpd , for example. I don't think that movaps between two addpd , or even xorps will cause this problem though: only the actual math FP. (FP bitwise logical operators are entire domains on the Bulldozer family.)
Theoretical bandwidth on Intel SnB / IvB (Intel's only processors with AVX but not AVX2):
256b operations with AVX xorps
VMOVDQU ymm0, [A] VXORPS ymm0, ymm0, [B] VMOVDQU [result], ymm0
3 hop-domain uops can give out once in 0.75 cycles, since the width of the pipeline is equal to 4 arched domains. (Assuming the addressing modes you use for B and the result may be micro-fuses, otherwise these are 5 compiled domains.)
download port: 256 bits of load / storage on the SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3, which the store will use. The data warehouse data port is allocated there, but AGU from the download port is required to calculate the address storage.
Thus, with only 128 or less workloads / storages, SnB / IvB can support two operational memory blocks per cycle (and at most one of them is storage). With 256b ops, SnB / IvB could theoretically support two 256b loads and store one 256b in two cycles . However, conflicts in cache banks make this impossible.
Haswell has a dedicated port address-to-port and can support two 256b loads and one 256b storage per cycle , and has no cache bank conflicts. So Haswell is much faster when everything is in L1 cache.
Bottom line: theoretically (without conflict with the cache) this should saturate the SnB load and store the ports, processing 128b per cycle. Port5 (only one xorps port can work) is required once every two hours.
128b ops
VMOVDQU xmm0, [A] VMOVDQU xmm1, [A+16] VPXOR xmm0, xmm0, [B] VPXOR xmm1, xmm1, [B+16] VMOVDQU [result], xmm0 VMOVDQU [result+16], xmm1
This will be a bottleneck in address generation since SnB can only support two 128-bit memory operation blocks per cycle. It will also use 2x as much space in the uop cache and more than the size of x86 machine code. The prohibition of cache bank conflicts, this should be done with a throughput of one 256b-xor per 3 clock cycles.
In registers
Between the registers, one 256b VXORPS and two 128b VPXOR per hour will saturate SnB. On Haswell, three AVX2 256b VPXOR per cycle produce the most operations per cycle. ( xorps and PXOR do the same, but xorps output can be forwarded to FP executables without an additional transfer delay cycle. I believe that only one execution unit has wiring to get the XOR result in the FP domain so Intel processors after Nehalem only start XORPS on one port .)
Z Boson hybrid idea:
VMOVDQU ymm0, [A] VMOVDQU ymm4, [B] VEXTRACTF128 xmm1, ymm0, 1 VEXTRACTF128 xmm5, ymm1, 1 VPXOR xmm0, xmm0, xmm4 VPXOR xmm1, xmm1, xmm5 VMOVDQU [res], xmm0 VMOVDQU [res+16], xmm1
More fops-domain uops (8) than just doing 128b-everything.
Download / save: two 256b downloads leave two backup cycles for two storage addresses that can be generated, so this can be done with two loads / one 128b storage per cycle.
ALU: two ports-5 uops (vextractf128), two ports 0/1/5 uops (vpxor).
Thus, it still has the bandwidth of one 256b result in 2 cycles , but it saturates more resources and has no advantages (according to Intel) in version 3 of the 256b instruction.