I want to copy the image to the ARMv7 core. The naive implementation is calling memcpy on a string.
for(i = 0; i < h; i++) { memcpy(d, s, w); s += sp; d += dp; }
I know the following
d, dp, s, sp, w
all 32 bytes are aligned, so my next (still very naive) implementation was in lines
for (int i = 0; i < h; i++) { uint8_t* dst = d; const uint8_t* src = s; int remaining = w; asm volatile ( "1: \n" "subs %[rem], %[rem], #32 \n" "vld1.u8 {d0, d1, d2, d3}, [%[src],:256]! \n" "vst1.u8 {d0, d1, d2, d3}, [%[dst],:256]! \n" "bgt 1b \n" : [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining) : : "d0", "d1", "d2", "d3", "cc", "memory" ); d += dp; s += sp; }
Which was 150% faster than memcpy on a large number of iterations (on different images, so without using caching). I feel that it should be nowhere near optimal, because I have not used preloading yet, but when I do this, it seems to me that I can make performance much worse. Does anyone know here?
source share