Proper use of the ARM PLD instruction (ARM11)

ARM ARM does not actually provide the correct way to use this instruction, but I found that it was used elsewhere to know that it takes the address as a hint about where to read the next value.

My question is that with a 256-byte loop of a compressed ldm/stm command, say r4-r11 x 8, it would be better to pre-program each cache line before the copy between each pair of commands or not to do it at all, since the memcpy question is not is both reading and writing to the same memory area. Pretty sure my cache size is 64 bytes, but maybe 32 bytes - waiting for confirmation before writing the final code here.

+4
source share
3 answers

From the Cortex-A Series Programmer's Guide , Chapter 17.4 (NB: some details may differ for ARM11):

The best performance for memcpy () is achieved using the whole cache LDM and then writing these values ​​from the STM of the entire cache line. Aligning stores is more important than balancing loads. Use the PLD instruction where possible. There are four PLD slots in the load / storage unit. The PLC command takes precedence over the automatic preassembler and has no value in terms of an integer conveyor view. The exact time of PLD instructions for better memcpy () may differ slightly between systems, but PLD to the address of the three cache lines in front of the current copy line is a useful starting point.

+5
source

An example of a fairly general copy cycle that uses blocks of size LDM / STM with cache size and / or PLD , where possible, can be found in the Linux kernel, arch/arm/lib/copy_page.S . This implements what Igor mentions above regarding the use of preloads and illustrates the blocking.

Please note that in ARMv7 (where the cache size is usually 64 bytes), it is not possible for LDM use full cache as a single operand (there are only 14 registers that you could use, since SP / PC cannot touch this). Therefore, you may need to use two / four pairs of LDM / STM .

+3
source

To really get the β€œfastest” asm ARM code possible, you will need to test different approaches to your system. As for the ldm / stm loop, this one seems to work better for me:

  // Use non-conflicting register r12 to avoid waiting for r6 in pld pld [r6, #0] add r12, r6, #32 1: ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9} pld [r12, #32] stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9} subs r11, r11, #16 ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9} pld [r12, #64] stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9} add r12, r6, #32 bne 1b 

The above block assumes that you already have r6, r10, r11 installed, and these cycles are counted in terms of the words r11, not bytes. I tested this on a Cortex-A9 (iPad2), and it seems to work pretty well on this processor. But be careful, because on the Cortex-A8 (iPhone4), the NEON loop is apparently faster than ldm / stm, at least for large copies.

+1
source

All Articles