Bitwise xor of two 256-bit integers

I have an AVX processor (which does not support AVX2) and I want to calculate the bitwise xor from two 256-bit integers.

Since _mm256_xor_si256 is only available on AVX2, I can load these 256 bits as __m256 using _mm256_load_ps and then do _mm256_xor_ps . Will this generate the expected result?

My main problem is that the contents of the memory is not a valid floating point number, will _mm256_load_ps not load bits into registers exactly the same as in memory?

Thanks.

+7
sse avx simd
source share
3 answers

First of all, if you do other things with your 256b integers (for example, adding / subtracting / multiplying), getting them in vector registers just for random XOR may not be worth the overhead of porting them. If you have two numbers that are already written in registers (using up to 8 full registers), to get the result (and 4 mov instructions, only four xor commands are required) if you need to avoid overwriting the destination). The destructive version can work with one hour at 1.33 beats on SnB, or one at a time on Haswell and later. ( xor can work on any of the 4 ALU ports). Therefore, if you are just doing a single xor between some add/adc or others, stick with integers.

Saving memory in 64-bit fragments, and then executing a 128b or 256b load will cause the store to fail to resell , adding a few more delay cycles. Using movq / pinsrq will cost more runtime resources than xor . Another thing is not so bad: 256b store โ†’ 64b downloads great for sending a store. movq / pextrq will still suck, but will have a lower latency (due to more uops).


FP loading / storing / bitwise operations are guaranteed architecturally so as not to generate FP exceptions, even if they are used in bit patterns that are NaN signaling. Only actual FP math instructions list math exceptions:

VADDPS

SIMD floating point exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

VMOVAPS

SIMD floating point exceptions
Are absent.

(From the Intel insn ref manual. See the x86 wiki for links to this and other materials.)

On Intel hardware, any taste of loading / storing can go to FP or an integer domain without additional delay. AMD behaves similarly, whatever the taste of the load / storage, no matter where the data goes.

The various options for the vector move command really matter for moving the <-register register . In Intel Nehalem, using the wrong mov command can cause a bypass delay. In the AMD Bulldozer family, where moves are handled by renaming registers rather than copying data (for example, Intel IvB and later), the dest register inherits the domain of what the src register wrote.

No existing project that I read about handled movapd differently than movaps . Presumably, Intel created movapd as much as possible for ease of decoding, as for future planning (for example, for the possibility of creating a design where there is a double domain and one domain with different forwarding networks). ( movapd has movaps with a 66h prefix, just like the double version of every other SSE instruction has only a 66h byte prefix. Or F2 instead of F3 for scalar instructions.)

AMD seems to be developing FP vector tags with supporting information because Agner Fog detected a big delay when using the addps output as input for addpd , for example. I don't think that movaps between two addpd , or even xorps will cause this problem though: only the actual math FP. (FP bitwise logical operators are entire domains on the Bulldozer family.)


Theoretical bandwidth on Intel SnB / IvB (Intel's only processors with AVX but not AVX2):

256b operations with AVX xorps

 VMOVDQU ymm0, [A] VXORPS ymm0, ymm0, [B] VMOVDQU [result], ymm0 
  • 3 hop-domain uops can give out once in 0.75 cycles, since the width of the pipeline is equal to 4 arched domains. (Assuming the addressing modes you use for B and the result may be micro-fuses, otherwise these are 5 compiled domains.)

  • download port: 256 bits of load / storage on the SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3, which the store will use. The data warehouse data port is allocated there, but AGU from the download port is required to calculate the address storage.

    Thus, with only 128 or less workloads / storages, SnB / IvB can support two operational memory blocks per cycle (and at most one of them is storage). With 256b ops, SnB / IvB could theoretically support two 256b loads and store one 256b in two cycles . However, conflicts in cache banks make this impossible.

    Haswell has a dedicated port address-to-port and can support two 256b loads and one 256b storage per cycle , and has no cache bank conflicts. So Haswell is much faster when everything is in L1 cache.

Bottom line: theoretically (without conflict with the cache) this should saturate the SnB load and store the ports, processing 128b per cycle. Port5 (only one xorps port can work) is required once every two hours.


128b ops

 VMOVDQU xmm0, [A] VMOVDQU xmm1, [A+16] VPXOR xmm0, xmm0, [B] VPXOR xmm1, xmm1, [B+16] VMOVDQU [result], xmm0 VMOVDQU [result+16], xmm1 

This will be a bottleneck in address generation since SnB can only support two 128-bit memory operation blocks per cycle. It will also use 2x as much space in the uop cache and more than the size of x86 machine code. The prohibition of cache bank conflicts, this should be done with a throughput of one 256b-xor per 3 clock cycles.


In registers

Between the registers, one 256b VXORPS and two 128b VPXOR per hour will saturate SnB. On Haswell, three AVX2 256b VPXOR per cycle produce the most operations per cycle. ( xorps and PXOR do the same, but xorps output can be forwarded to FP executables without an additional transfer delay cycle. I believe that only one execution unit has wiring to get the XOR result in the FP domain so Intel processors after Nehalem only start XORPS on one port .)


Z Boson hybrid idea:

 VMOVDQU ymm0, [A] VMOVDQU ymm4, [B] VEXTRACTF128 xmm1, ymm0, 1 VEXTRACTF128 xmm5, ymm1, 1 VPXOR xmm0, xmm0, xmm4 VPXOR xmm1, xmm1, xmm5 VMOVDQU [res], xmm0 VMOVDQU [res+16], xmm1 

More fops-domain uops (8) than just doing 128b-everything.

Download / save: two 256b downloads leave two backup cycles for two storage addresses that can be generated, so this can be done with two loads / one 128b storage per cycle.

ALU: two ports-5 uops (vextractf128), two ports 0/1/5 uops (vpxor).

Thus, it still has the bandwidth of one 256b result in 2 cycles , but it saturates more resources and has no advantages (according to Intel) in version 3 of the 256b instruction.

+9
source share

No problem using _mm256_load_ps to load integers. In fact, in this case, it is better than using _mm256_load_si256 (which works with AVX), because you stay in the floating point domain with _mm256_load_ps .

 #include <x86intrin.h> #include <stdio.h> int main(void) { int a[8] = {1,2,3,4,5,6,7,8}; int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9}; __m256 a8 = _mm256_loadu_ps((float*)a); __m256 b8 = _mm256_loadu_ps((float*)b); __m256 c8 = _mm256_xor_ps(a8,b8); int c[8]; _mm256_storeu_ps((float*)c, c8); printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]); } 

If you want to stay in an integer domain, you can do

 #include <x86intrin.h> #include <stdio.h> int main(void) { int a[8] = {1,2,3,4,5,6,7,8}; int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9}; __m256i a8 = _mm256_loadu_si256((__m256i*)a); __m256i b8 = _mm256_loadu_si256((__m256i*)b); __m128i a8lo = _mm256_castsi256_si128(a8); __m128i a8hi = _mm256_extractf128_si256(a8, 1); __m128i b8lo = _mm256_castsi256_si128(b8); __m128i b8hi = _mm256_extractf128_si256(b8, 1); __m128i c8lo = _mm_xor_si128(a8lo, b8lo); __m128i c8hi = _mm_xor_si128(a8hi, b8hi); int c[8]; _mm_storeu_si128((__m128i*)&c[0],c8lo); _mm_storeu_si128((__m128i*)&c[4],c8hi); printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]); } 

The internal functions of _mm256_castsi256_si128 free.

+2
source share

You will probably find that the difference in performance is practically small than when using 2 x _mm_xor_si128 . It is even possible that the AVX implementation will be slower because _mm256_xor_ps has a return bandwidth of 1 on SB / IB / Haswell, while _mm_xor_si128 has a return bandwidth of 0.33.

+1
source share

All Articles