It would be nice to split the cache and allocate each array in a different part.
Unfortunately this is not possible. CortexA8 cereals are simply not that flexible. The old old StrongArm had an additional cache for this separation purpose, but it is no longer available. Instead, we have L1 and L2 caches (overall a good change to imho.)
However, you can do the following:
The NEON SIMD CortexA8 block lags behind the general-purpose processing block by about 10 processor cycles. With clever programming, you can prefetch the cache from a universal device, but access through NEON. The delay between the two pipelines gives the cache some time to prefetch, so the average cache miss time will be lower.
The disadvantage is that if you should never move the result of the calculation from NEON to the ARM block. Because NEON is lagging behind this, this will lead to a full processor thread. Almost, if not more expensive, because cache miss.
The difference in performance can be significant. Suddenly I would expect a 20% to 30% speed improvement.
Nils pipenbrinck
source share