Fast ARM NEON memcpy

I want to copy the image to the ARMv7 core. The naive implementation is calling memcpy on a string.

for(i = 0; i < h; i++) { memcpy(d, s, w); s += sp; d += dp; } 

I know the following

 d, dp, s, sp, w 

all 32 bytes are aligned, so my next (still very naive) implementation was in lines

 for (int i = 0; i < h; i++) { uint8_t* dst = d; const uint8_t* src = s; int remaining = w; asm volatile ( "1: \n" "subs %[rem], %[rem], #32 \n" "vld1.u8 {d0, d1, d2, d3}, [%[src],:256]! \n" "vst1.u8 {d0, d1, d2, d3}, [%[dst],:256]! \n" "bgt 1b \n" : [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining) : : "d0", "d1", "d2", "d3", "cc", "memory" ); d += dp; s += sp; } 

Which was 150% faster than memcpy on a large number of iterations (on different images, so without using caching). I feel that it should be nowhere near optimal, because I have not used preloading yet, but when I do this, it seems to me that I can make performance much worse. Does anyone know here?

+4
source share
1 answer

ARM has a great technical note.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

Your performance will certainly vary depending on the microarchitecture, the ARM note on A8, but I think this will give you a decent idea, and the summary below is a great discussion of the various pros and cons that go beyond the usual numbers, for example, which methods lead to the least use of registers, etc.

And yes, as another commenter mentions, prefetching is very difficult to work properly and will work differently with different micro architectures, depending on how big the caches are and how big each line and a bunch of other details about the cache design are. If you are not careful, you may fall on the lines you need. I would recommend avoiding it for portable code.

+4
source

All Articles