ARM Cortex-A8: How many bytes are extracted in one memory?

I am trying to improve an image processing project running on an ARM cortex-a8 processor.

I was getting 8-bit Grayscale images from memory. In my function, right now I get an individual pixel value, byte-by-bit.

I thought that using NEON, I can improve this by accessing 128/8 = 16 bytes per frame from memory, and then I will use them in my function. But after launching the modified version, I see that it actually takes MORE than byte access. I think that my selection using NEON becomes a bottleneck, taking up more time than my computation time.

What is the size of the ARM Cortex-A8 data bus? How many bytes are available from memory in one memory sample?

+4
source share
2 answers

From Cortex A8 TRM:

"You can configure the processor to connect to a 64-bit or 128-bit AXI connection for flexibility in system design."

NEON, maybe you are comparing apples to oranges? Instead of ldrb / strb you can use ldrd / strd or ldm / stm to get 64-bit transfers. ARM / AXI can be smart enough to look ahead and group smaller transfers into larger transfers, say, two 32-bit transfers into one 64-bit. But I would not rely on it. I only mention this if you find that switching to ldr / str or ldrd / strd does not give you a performance boost.

Have you isolated (without data processing) a read or write cycle and tried to use bytes against words or double words? Maybe the code for extracting bytes from words overloads the bus savings.

What type of memory is this? Is it on the chip or off the chip, what is it, what is the speed of this memory relative to the clock frequency of AXI (ARM)?

Do you have a data cache for this region? If so, then this may be a dumb point, the first byte reading will fill the cache line using the optimal data bus size, subsequent reads in this cache line will not reach the AXI bus much less than the target memory. Similarly, records should only reach the cache and reach the target in a wider size optimized for the bus. Depends on how the cache / write buffer is configured.

+3
source

You may encounter piping kiosks. If you want to read through Neon, there will be some waiting time before you can use this data in the processor core.

0
source

All Articles