The more important question is why you want to do this manually. Do you have an ancient compiler that you think you can outwit? Those good old days when you had to write SIMD instructions manually ended. Today, in 99% of cases, the compiler will do the job for you, and most likely it will be much better. Also, don't forget that new architectures appear every time there are more and more advanced instruction sets. So ask yourself, do you want to keep N copies of your implementation for each platform? Do you want to constantly test your implementation to make sure it is worth supporting? Most likely, the answer will be no.
The only thing you need to do is write the simplest possible code. The compiler will do the rest. For example, here is how I will write your function:
void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int len) { unsigned int i; for (i = 0; i < len; ++i) r2[i] = r1[i] ^ r2[i]; }
A bit easier, isn't it? And guess what, the compiler generates code that executes 128-bit XOR using MOVDQU and PXOR , the critical path is as follows:
4008a0: f3 0f 6f 04 06 movdqu xmm0,XMMWORD PTR [rsi+rax*1] 4008a5: 41 83 c0 01 add r8d,0x1 4008a9: f3 0f 6f 0c 07 movdqu xmm1,XMMWORD PTR [rdi+rax*1] 4008ae: 66 0f ef c1 pxor xmm0,xmm1 4008b2: f3 0f 7f 04 06 movdqu XMMWORD PTR [rsi+rax*1],xmm0 4008b7: 48 83 c0 10 add rax,0x10 4008bb: 45 39 c1 cmp r9d,r8d 4008be: 77 e0 ja 4008a0 <region_xor_w64+0x40>
As @Mysticial noted, the code above uses an instruction that supports unrelated access. It is slower. If, however, the programmer can correctly assume aligned access, then you can tell the compiler about this. For instance:
void region_xor_w64(unsigned char * restrict r1, unsigned char * restrict r2, unsigned int len) { unsigned char * restrict p1 = __builtin_assume_aligned(r1, 16); unsigned char * restrict p2 = __builtin_assume_aligned(r2, 16); unsigned int i; for (i = 0; i < len; ++i) p2[i] = p1[i] ^ p2[i]; }
The compiler generates the following for the above C code ( movdqa notification):
400880: 66 0f 6f 04 06 movdqa xmm0,XMMWORD PTR [rsi+rax*1] 400885: 41 83 c0 01 add r8d,0x1 400889: 66 0f ef 04 07 pxor xmm0,XMMWORD PTR [rdi+rax*1] 40088e: 66 0f 7f 04 06 movdqa XMMWORD PTR [rsi+rax*1],xmm0 400893: 48 83 c0 10 add rax,0x10 400897: 45 39 c1 cmp r9d,r8d 40089a: 77 e4 ja 400880 <region_xor_w64+0x20>
Tomorrow, when I buy a laptop with a Haswell processor, the compiler generates a code for me that uses 256-bit instructions instead of the 128-bit one from the same code, giving me twice the vector performance. That would do it, even if I did not know that Haswell was capable of it. You will need to not only know about this function, but also write another version of your code and spend some time testing it.
By the way, it looks like you also have an error in your implementation where the code can skip up to 3 remaining bytes in the data vector.
Anyway, I would recommend you trust your compiler and find out how to check what is being generated (i.e. get to know objdump ). The next choice would be to change the compiler. Only then start thinking about writing instructions for manually processing vectors. Or you will have a bad time!
Hope this helps. Good luck