A "memcpy" -like function that supports single-bit offsets?

Question

A "memcpy" -like function that supports single-bit offsets?

I was thinking about solving this, but it looked like a pretty daunting task. If I take this myself, I will most likely write it in several different ways and choose the best, so I thought I was asking this question to see if there is a good library that solves this already or if someone has thoughts / advice.

void OffsetMemCpy(u8* pDest, u8* pSrc, u8 srcBitOffset, size size) { // Or something along these lines. srcBitOffset is 0-7, so the pSrc buffer // needs to be up to one byte longer than it would need to be in memcpy. // Maybe explicitly providing the end of the buffer is best. // Also note that pSrc has NO alignment assumptions at all. }

My application is critical for time, so I want to nail it with minimal overhead. This is a source of complexity / complexity. In my case, the blocks are likely to be quite small, maybe 4-12 bytes, so large memcpy stuff (like prefetch) is not that important. The best result would be one that scans faster for constant input "size", between 4 and 12, for random unbalanced src buffers.

Memory should be moved in word-size blocks when possible.
It is important to align the blocks with the size of the word. pSrc does not align, so we may need to read a few bytes from the front until it is aligned.

Does anyone have or know about a similar thing? Or does someone want to take a hit by writing it, making it as clean and efficient as possible?

Edit: It seems people are voting for this “closing” for “too wide.” A few narrowing details will be AMD64 - the preferred architecture, so let's assume this. This means that we have little endian, etc. The implementation, I hope, will fit the size of the answer, so I don’t think it is too wide. I ask for answers that are a single implementation at a time, although there are several approaches.

+4

c ++ optimization c bit-manipulation memcpy

Voidstar Aug 17 '15 at 6:22

source share

2 answers

I think a trivial byte solution (see @PaulR answer) is the best approach for small blocks if you cannot satisfy the following additional restrictions:

The input buffer is assigned by some addition, that is, access to some bytes after the latter is not a failure.
The output buffer is also allocated with some addition, and it does not matter if several bytes are overwritten after the desired location of the result is overwritten. If that matters, you will need to do more things to keep these aftermarket bytes.
Input and output ranges do not overlap (including a few extra bytes after completion), as in memcpy.

If you can, then you can increase the granularity of the algorithm. It is very easy to change @PaulR’s answer to use uint64_t instead of uint8_t bytes. As a result, it will work faster.

We can use SSE to further increase the size of the word. Since there is no way in SSE to shift the entire register by bits, we have to make two shifts for 64-bit integers, and then glue the results together. Bonding is done using _mm_shuffle_epi8 from SSSE3, which allows you to arbitrarily move bytes in the XMM register. For switching, we use _mm_srl_epi64 , because this is the only way to shift 64-bit integers using the immediate number of bits. I added the restrict keyword from C (as a macro) to the pointer arguments, because if they are smooth, the algorithm will not work anyway.

Here is the code:

 void OffsetMemCpy_stgatilov(uint8_t *RESTRICT pDest, const uint8_t *RESTRICT pSrc, const uint8_t srcBitOffset, const size_t size) { __m128i bits = (sizeof(size_t) == 8 ? _mm_cvtsi64_si128(srcBitOffset) : _mm_cvtsi32_si128(srcBitOffset)); const uint8_t *pEnd = pSrc + size; while (pSrc < pEnd) { __m128i input = _mm_loadu_si128((__m128i*)pSrc); __m128i reg = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 12, 13, 14)); __m128i shifted = _mm_srl_epi64(reg, bits); __m128i comp = _mm_shuffle_epi8(shifted, _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, -1, -1)); _mm_storeu_si128((__m128i*)pDest, comp); pSrc += 14; pDest += 14; } }

It processes 14 bytes per iteration. Each iteration is quite simple, there is also code before the loop. Here is the full-body assembly code generated by MSVC2013 x64:

  movzx eax, r8b movd xmm3, rax lea rax, QWORD PTR [rdx+r9] cmp rdx, rax jae SHORT $LN1@OffsetMemC movdqa xmm1, XMMWORD PTR __xmm@0e0d0c0b0a0908070706050403020100 movdqa xmm2, XMMWORD PTR __xmm@ffff0e0d0c0b0a090806050403020100 sub rcx, rdx npad 11 $LL2@OffsetMemC : movdqu xmm0, XMMWORD PTR [rdx] add rdx, 14 pshufb xmm0, xmm1 psrlq xmm0, xmm3 pshufb xmm0, xmm2 movdqu XMMWORD PTR [rcx+rdx-14], xmm0 cmp rdx, rax jb SHORT $LL2@OffsetMemC $LN1@OffsetMemC : ret 0

IACA says that the whole function takes 4.5 cycles of bandwidth and 13 cycles of delay on Ivy Bridge, given that the cycle runs once and there are no problems with caches / branches / decoding. However, in the test, 7.5 cycles are spent on one such call on average.

Below are brief results of the Ivy Bridge 3.4 Ghz bandwidth test (see more results in the code):

 (billions of calls per second) size = 4: 0.132 (Paul R) 0.248 (Paul R x64) 0.45 (stgatilov) size = 8: 0.0782 (Paul R) 0.249 (Paul R x64) 0.45 (stgatilov) size = 12: 0.0559 (Paul R) 0.191 (Paul R x64) 0.453 (stgatilov)

Please note, however, that in the real world, performance can vary significantly from test results.

Full benchmarking code and more detailed results are here .

+1

stgatilov Aug 27 '15 at 15:01

source share

Paul r · Accepted Answer · 2015-08-17T07:19:03+0000

I would start with a simple implementation such as:

 inline void OffsetMemCpy(uint8_t* pDest, const uint8_t* pSrc, const uint8_t srcBitOffset, const size_t size) { if (srcBitOffset == 0) { for (size_t i = 0; i < size; ++i) { pDest[i] = pSrc[i]; } } else if (size > 0) { uint8_t v0 = pSrc[0]; for (size_t i = 0; i < size; ++i) { uint8_t v1 = pSrc[i + 1]; pDest[i] = (v0 << srcBitOffset) | (v1 >> (CHAR_BIT - srcBitOffset)); v0 = v1; } } }

(warning: unverified code!).

Once this works, profile it in your application - you can find it fast enough for your needs and thereby avoid the pitfalls of premature optimization. If not, then you have a useful basic reference implementation for further optimization.

Remember that for small copies, the overhead of testing for alignment and copies of word size, etc. can significantly outweigh any benefits, so a simple byte loop like the one above can be close to optimal.

Also note that optimization may be architecture dependent — micro-optimizations that benefit one processor may be counterproductive on another.

A "memcpy" -like function that supports single-bit offsets?

More articles: