I think a trivial byte solution (see @PaulR answer) is the best approach for small blocks if you cannot satisfy the following additional restrictions:
- The input buffer is assigned by some addition, that is, access to some bytes after the latter is not a failure.
- The output buffer is also allocated with some addition, and it does not matter if several bytes are overwritten after the desired location of the result is overwritten. If that matters, you will need to do more things to keep these aftermarket bytes.
- Input and output ranges do not overlap (including a few extra bytes after completion), as in memcpy.
If you can, then you can increase the granularity of the algorithm. It is very easy to change @PaulRโs answer to use uint64_t instead of uint8_t bytes. As a result, it will work faster.
We can use SSE to further increase the size of the word. Since there is no way in SSE to shift the entire register by bits, we have to make two shifts for 64-bit integers, and then glue the results together. Bonding is done using _mm_shuffle_epi8 from SSSE3, which allows you to arbitrarily move bytes in the XMM register. For switching, we use _mm_srl_epi64 , because this is the only way to shift 64-bit integers using the immediate number of bits. I added the restrict keyword from C (as a macro) to the pointer arguments, because if they are smooth, the algorithm will not work anyway.
Here is the code:
void OffsetMemCpy_stgatilov(uint8_t *RESTRICT pDest, const uint8_t *RESTRICT pSrc, const uint8_t srcBitOffset, const size_t size) { __m128i bits = (sizeof(size_t) == 8 ? _mm_cvtsi64_si128(srcBitOffset) : _mm_cvtsi32_si128(srcBitOffset)); const uint8_t *pEnd = pSrc + size; while (pSrc < pEnd) { __m128i input = _mm_loadu_si128((__m128i*)pSrc); __m128i reg = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 12, 13, 14)); __m128i shifted = _mm_srl_epi64(reg, bits); __m128i comp = _mm_shuffle_epi8(shifted, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, -1, -1)); _mm_storeu_si128((__m128i*)pDest, comp); pSrc += 14; pDest += 14; } }
It processes 14 bytes per iteration. Each iteration is quite simple, there is also code before the loop. Here is the full-body assembly code generated by MSVC2013 x64:
movzx eax, r8b movd xmm3, rax lea rax, QWORD PTR [rdx+r9] cmp rdx, rax jae SHORT $LN1@OffsetMemC movdqa xmm1, XMMWORD PTR __xmm@0e0d0c0b0a0908070706050403020100 movdqa xmm2, XMMWORD PTR __xmm@ffff0e0d0c0b0a090806050403020100 sub rcx, rdx npad 11 $LL2@OffsetMemC : movdqu xmm0, XMMWORD PTR [rdx] add rdx, 14 pshufb xmm0, xmm1 psrlq xmm0, xmm3 pshufb xmm0, xmm2 movdqu XMMWORD PTR [rcx+rdx-14], xmm0 cmp rdx, rax jb SHORT $LL2@OffsetMemC $LN1@OffsetMemC : ret 0
IACA says that the whole function takes 4.5 cycles of bandwidth and 13 cycles of delay on Ivy Bridge, given that the cycle runs once and there are no problems with caches / branches / decoding. However, in the test, 7.5 cycles are spent on one such call on average.
Below are brief results of the Ivy Bridge 3.4 Ghz bandwidth test (see more results in the code):
(billions of calls per second) size = 4: 0.132 (Paul R) 0.248 (Paul R x64) 0.45 (stgatilov) size = 8: 0.0782 (Paul R) 0.249 (Paul R x64) 0.45 (stgatilov) size = 12: 0.0559 (Paul R) 0.191 (Paul R x64) 0.453 (stgatilov)
Please note, however, that in the real world, performance can vary significantly from test results.
Full benchmarking code and more detailed results are here .