I am exploring performance hotspots in an application that spends 50% of its time in memmove (3). The application inserts millions of 4-byte integers into the sorted arrays and uses memmove to shift the data "to the right" in order to free up space for the inserted value.
My expectation was that copying memory is very fast, and I was surprised that so much time wasted on memory. But then I had the idea that memmove is slow because it moves the overlapping areas that should be implemented in a tight loop, instead of copying large pages of memory. I wrote a small microbenchmark to find out if there is a performance difference between memcpy and memmove, expecting memcpy to win hands.
I ran my benchmark on two machines (i5 core, i7 core) and saw that memmove is actually faster than memcpy, on the older i7 core it is even almost twice as fast! Now I am looking for an explanation.
Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination overlap. Different "distances" for source and destination. Each test is performed 10 times, the average time is printed.
https://gist.github.com/cruppstahl/78a57cdf937bca3d062c
Here are the results on Core i5 (Linux 3.5.0-54-generi # 81 ~ exact1-Ubuntu SMP x86_64 GNU / Linux, gcc - 4.6.3 (Ubuntu / Linaro 4.6.3-1ubuntu5). The number in brackets is the distance (gap size ) between source and destination:
memcpy 0.0140074 memmove (002) 0.0106168 memmove (004) 0.01065 memmove (008) 0.0107917 memmove (016) 0.0107319 memmove (032) 0.0106724 memmove (064) 0.0106821 memmove (128) 0.0110633
Memmove is implemented as an SSE-optimized assembler code, copying it back to the front. It uses prefetching equipment to load data into the cache and copies 128 bytes to XMM registers and then saves them at the destination.
( memcpy-ssse3-back.S , lines 1650 ff)
L(gobble_ll_loop): prefetchnta -0x1c0(%rsi) prefetchnta -0x280(%rsi) prefetchnta -0x1c0(%rdi) prefetchnta -0x280(%rdi) sub $0x80, %rdx movdqu -0x10(%rsi), %xmm1 movdqu -0x20(%rsi), %xmm2 movdqu -0x30(%rsi), %xmm3 movdqu -0x40(%rsi), %xmm4 movdqu -0x50(%rsi), %xmm5 movdqu -0x60(%rsi), %xmm6 movdqu -0x70(%rsi), %xmm7 movdqu -0x80(%rsi), %xmm8 movdqa %xmm1, -0x10(%rdi) movdqa %xmm2, -0x20(%rdi) movdqa %xmm3, -0x30(%rdi) movdqa %xmm4, -0x40(%rdi) movdqa %xmm5, -0x50(%rdi) movdqa %xmm6, -0x60(%rdi) movdqa %xmm7, -0x70(%rdi) movdqa %xmm8, -0x80(%rdi) lea -0x80(%rsi), %rsi lea -0x80(%rdi), %rdi jae L(gobble_ll_loop)
Why is memmove faster than memcpy? I would expect memcpy to copy pages of memory, which should be much faster than a loop. In the worst case scenario, I would expect memcpy to be as fast as memmove.
PS: I know that I cannot replace memmove memcpy in my code. I know this sample code mixes C and C ++. This question is really just for academic purposes.
UPDATE 1
I did some test variations based on different answers.
- When you start memcpy for the first time, the second run is faster than the first.
- When "touching" the memcpy target buffer (
memset(b2, 0, BUFFERSIZE...) ), the first memcpy run is also faster. - memcpy is still a little slower than memmove.
Here are the results:
memcpy 0.0118526 memcpy 0.0119105 memmove (002) 0.0108151 memmove (004) 0.0107122 memmove (008) 0.0107262 memmove (016) 0.0108555 memmove (032) 0.0107171 memmove (064) 0.0106437 memmove (128) 0.0106648
My conclusion: based on a comment from @Oliver Charlesworth, the operating system should commit physical memory as soon as the memcpy cache is received for the first time (if anyone knows how to βproveβ this, please add an answer!). Also, as @Mats Petersson said, memmove is a cache more friendly than memcpy.
Thanks for the great answers and comments!