Optimizing the use of the ARM cache for different arrays

Question

Optimizing the use of the ARM cache for different arrays

I want to transfer a small piece of code to an ARM Cortex A8 processor. Both L1 cache and L2 cache are very limited. There are 3 arrays in my program. Two of them are sequentially available (size> Array A: 6MB and Array B: 3MB), and the access pattern for the third array (size> Array C: 3MB) is unpredictable. Although the calculations are not very strict, there are huge cache misses for accessing array C. One solution that I thought would be allocated for more cache (L2) for array C and less for Array A and B. But I can not find any way to achieve this. I went through the ARM preliminary engine, but could not find anything useful.

+6

arm cpu-cache

user285999 Mar 04 '10 at 6:09

source share

2 answers

Nils pipenbrinck · Answer 1 · 2010-03-04T21:11:08+0000

It would be nice to split the cache and allocate each array in a different part.

Unfortunately this is not possible. CortexA8 cereals are simply not that flexible. The old old StrongArm had an additional cache for this separation purpose, but it is no longer available. Instead, we have L1 and L2 caches (overall a good change to imho.)

However, you can do the following:

The NEON SIMD CortexA8 block lags behind the general-purpose processing block by about 10 processor cycles. With clever programming, you can prefetch the cache from a universal device, but access through NEON. The delay between the two pipelines gives the cache some time to prefetch, so the average cache miss time will be lower.

The disadvantage is that if you should never move the result of the calculation from NEON to the ARM block. Because NEON is lagging behind this, this will lead to a full processor thread. Almost, if not more expensive, because cache miss.

The difference in performance can be significant. Suddenly I would expect a 20% to 30% speed improvement.

Brooks moses · Answer 2 · 2010-03-05T02:07:22+0000

From what I can find through Google, it looks like ARMv7 (this is the ISA version supported by the Cortex A8) has caching capabilities, although I could not find clear help on how to use it - maybe you can do better if spend more time on it than the minute or two that I typed "ARM cache flush" in the search field and read the results.

In any case, you should be able to get closer to what you want by periodically issuing flash instructions to wash parts A and B that, as you know, you no longer need.

Optimizing the use of the ARM cache for different arrays

More articles: