Why is memmove faster than memcpy?

I am exploring performance hotspots in an application that spends 50% of its time in memmove (3). The application inserts millions of 4-byte integers into the sorted arrays and uses memmove to shift the data "to the right" in order to free up space for the inserted value.

My expectation was that copying memory is very fast, and I was surprised that so much time wasted on memory. But then I had the idea that memmove is slow because it moves the overlapping areas that should be implemented in a tight loop, instead of copying large pages of memory. I wrote a small microbenchmark to find out if there is a performance difference between memcpy and memmove, expecting memcpy to win hands.

I ran my benchmark on two machines (i5 core, i7 core) and saw that memmove is actually faster than memcpy, on the older i7 core it is even almost twice as fast! Now I am looking for an explanation.

Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination overlap. Different "distances" for source and destination. Each test is performed 10 times, the average time is printed.

https://gist.github.com/cruppstahl/78a57cdf937bca3d062c

Here are the results on Core i5 (Linux 3.5.0-54-generi # 81 ~ exact1-Ubuntu SMP x86_64 GNU / Linux, gcc - 4.6.3 (Ubuntu / Linaro 4.6.3-1ubuntu5). The number in brackets is the distance (gap size ) between source and destination:

memcpy 0.0140074 memmove (002) 0.0106168 memmove (004) 0.01065 memmove (008) 0.0107917 memmove (016) 0.0107319 memmove (032) 0.0106724 memmove (064) 0.0106821 memmove (128) 0.0110633 

Memmove is implemented as an SSE-optimized assembler code, copying it back to the front. It uses prefetching equipment to load data into the cache and copies 128 bytes to XMM registers and then saves them at the destination.

( memcpy-ssse3-back.S , lines 1650 ff)

 L(gobble_ll_loop): prefetchnta -0x1c0(%rsi) prefetchnta -0x280(%rsi) prefetchnta -0x1c0(%rdi) prefetchnta -0x280(%rdi) sub $0x80, %rdx movdqu -0x10(%rsi), %xmm1 movdqu -0x20(%rsi), %xmm2 movdqu -0x30(%rsi), %xmm3 movdqu -0x40(%rsi), %xmm4 movdqu -0x50(%rsi), %xmm5 movdqu -0x60(%rsi), %xmm6 movdqu -0x70(%rsi), %xmm7 movdqu -0x80(%rsi), %xmm8 movdqa %xmm1, -0x10(%rdi) movdqa %xmm2, -0x20(%rdi) movdqa %xmm3, -0x30(%rdi) movdqa %xmm4, -0x40(%rdi) movdqa %xmm5, -0x50(%rdi) movdqa %xmm6, -0x60(%rdi) movdqa %xmm7, -0x70(%rdi) movdqa %xmm8, -0x80(%rdi) lea -0x80(%rsi), %rsi lea -0x80(%rdi), %rdi jae L(gobble_ll_loop) 

Why is memmove faster than memcpy? I would expect memcpy to copy pages of memory, which should be much faster than a loop. In the worst case scenario, I would expect memcpy to be as fast as memmove.

PS: I know that I cannot replace memmove memcpy in my code. I know this sample code mixes C and C ++. This question is really just for academic purposes.

UPDATE 1

I did some test variations based on different answers.

  • When you start memcpy for the first time, the second run is faster than the first.
  • When "touching" the memcpy target buffer ( memset(b2, 0, BUFFERSIZE...) ), the first memcpy run is also faster.
  • memcpy is still a little slower than memmove.

Here are the results:

 memcpy 0.0118526 memcpy 0.0119105 memmove (002) 0.0108151 memmove (004) 0.0107122 memmove (008) 0.0107262 memmove (016) 0.0108555 memmove (032) 0.0107171 memmove (064) 0.0106437 memmove (128) 0.0106648 

My conclusion: based on a comment from @Oliver Charlesworth, the operating system should commit physical memory as soon as the memcpy cache is received for the first time (if anyone knows how to β€œprove” this, please add an answer!). Also, as @Mats Petersson said, memmove is a cache more friendly than memcpy.

Thanks for the great answers and comments!

+76
c ++ performance c memory
Feb 20 '15 at 7:45
source share
4 answers

Your memmove calls shuffle memory between 2 - 128 bytes, and your memcpy source and destination are completely different. Be that as it may, that takes into account the difference in performance: if you copy to the same place, you will see that memcpy ends up, possibly faster, for example. on ideone.com :

 memmove (002) 0.0610362 memmove (004) 0.0554264 memmove (008) 0.0575859 memmove (016) 0.057326 memmove (032) 0.0583542 memmove (064) 0.0561934 memmove (128) 0.0549391 memcpy 0.0537919 

There is hardly anything in this - there is no evidence that writing to an already damaged memory page has a big impact, and we, of course, do not see half the time ... but this shows that there is nothing wrong with creating memcpy unnecessarily slower comparing apples to apples.

+51
Feb 20 '15 at 7:56
source share

When you use memcpy , entries need to go to the cache. When you use memmove , when you copy a small step forward, the memory you copy will already be in the cache (because it was read 2, 4, 16 or 128 bytes "backward"). Try making memmove , where the destination is a few megabytes (4 * cache size), and I suspect (but I can’t bother to check) that you will get similar results.

I guarantee that EVERYTHING will support the cache for large memory operations.

+17
Feb 20 '15 at 8:53
source share

Historically, memmove and memcopy are one and the same function. They worked the same and had the same implementation. It was then realized that memcopy was not necessarily (and often was not) defined to handle overlapping areas in any particular way.

The end result is that memmove was defined to handle overlapping areas in a certain way, even if it affects performance. Memcopy is supposed to use the best algorithm for non-overlapping regions. Implementations are usually almost identical.

The problem you are facing is that there are so many x86 hardware options that it is not possible to determine which method of moving memory will be the fastest. And even if you think that you have a result in one of the circumstances, that simple, like the difference in the "layout" in the memory layout, can lead to significantly greater cache performance.

You can either check what you are actually doing or ignore the problem and rely on tests done for the C library.

Edit: Oh, and last; moving large amounts of memory around is VERY slow. I would suggest that your application will run faster with something like a simple B-Tree implementation to handle your integers. (Oh you, okay)

Edit2: To summarize my extension in the comments: Micro-library is the problem here, it does not measure what you think. The tasks given by memcpy and memmove are significantly different from each other. If the job passed to memcpy is repeated several times with memmove or memcpy, the final results will not depend on which memory switching function UNLESS uses to make regions overlap.

+13
Feb 20 '15 at 8:06
source share

"memcpy is more efficient than memmove." In your case, you most likely do not do the same when you perform two functions.

In general, USE memmove is only if you need to. USE it when there is a very reasonable probability that the source and destination regions are full.

Link: https://www.youtube.com/watch?v=Yr1YnOVG-4g Dr. Jerry Cain, (Stanford Lecture Initiative System - 7) Time: 36:00

0
Dec 10 '16 at 2:42 on
source share



All Articles