From Cortex A8 TRM:
"You can configure the processor to connect to a 64-bit or 128-bit AXI connection for flexibility in system design."
NEON, maybe you are comparing apples to oranges? Instead of ldrb / strb you can use ldrd / strd or ldm / stm to get 64-bit transfers. ARM / AXI can be smart enough to look ahead and group smaller transfers into larger transfers, say, two 32-bit transfers into one 64-bit. But I would not rely on it. I only mention this if you find that switching to ldr / str or ldrd / strd does not give you a performance boost.
Have you isolated (without data processing) a read or write cycle and tried to use bytes against words or double words? Maybe the code for extracting bytes from words overloads the bus savings.
What type of memory is this? Is it on the chip or off the chip, what is it, what is the speed of this memory relative to the clock frequency of AXI (ARM)?
Do you have a data cache for this region? If so, then this may be a dumb point, the first byte reading will fill the cache line using the optimal data bus size, subsequent reads in this cache line will not reach the AXI bus much less than the target memory. Similarly, records should only reach the cache and reach the target in a wider size optimized for the bus. Depends on how the cache / write buffer is configured.
source share