To really get the βfastestβ asm ARM code possible, you will need to test different approaches to your system. As for the ldm / stm loop, this one seems to work better for me:
// Use non-conflicting register r12 to avoid waiting for r6 in pld pld [r6,
The above block assumes that you already have r6, r10, r11 installed, and these cycles are counted in terms of the words r11, not bytes. I tested this on a Cortex-A9 (iPad2), and it seems to work pretty well on this processor. But be careful, because on the Cortex-A8 (iPhone4), the NEON loop is apparently faster than ldm / stm, at least for large copies.
source share