Improving cache performance when iterating through a simple 2D array?

I am trying to come up with a way to rewrite the code below to improve cache performance (by decreasing cache misses) in the array.

I know that the array is stored in memory line by line (sequentially), so ary [0] [0], ary [0] [1], ary [0] [2], .... ary [1] [0] , ary [1] [1], ary [1] [2] ... ary [50] [0], ary [50] [1] ... ary [50] [50], However, I not sure how to use this information to help me figure out how to change the loop to improve cache performance.

for (c = 0; c < 50; c++) for (d = 0; d < 50; d++) ary[d][c] = ary[d][c] + 1; 
+4
source share
4 answers

If you want to access all cells in a row at once, just invert the two loops:

 for (d = 0; d < 50; d++) for (c = 0; c < 50; c++) ary[d][c] = ary[d][c] + 1; 

Or even

 for (d = 0; d < 50; d++) int[] array = ary[d]; for (c = 0; c < 50; c++) array[c] = array[c] + 1; 

But I doubt that it has any significant effect or even any effect whatsoever, especially on such a tiny array. Make your code simple and readable. Do not pre-optimize.

+4
source

Change the cycle order. You get access to arr[1][0] immediately after arr[0][0] . arr[1][0] is much further away, and arr[0][1] is located at the following address.

+3
source

You want to minimize the number of cache misses to improve performance. Each cache error leads to memory access and loading of a new block into the cache. This block contains not only the required value, but also additional adjacent values ​​from memory. You need to use the principle of locality, i.e. Use as many values ​​as possible from each memory access. As you mentioned in your observation, the array is stored line by line in memory, so interleaving the array in a consistent manner minimizes the number of misses in the cache. Returning to your code, or change the order of the loop:

 for (d = 0; d < 50; d++) for (c = 0; c < 50; c++) ary[d][c] = ary[d][c] + 1; 

or change the indices in the calculation:

 for (c = 0; c < 50; c++) for (d = 0; d < 50; d++) ary[c][d] = ary[c][d] + 1; 

You can even consider a 2D array as a 1 * 50 * 50 array and just use a single loop to scan from start to finish.

+1
source

You probably do not need to do anything except replace the loop, because the caches are designed to use the locality of the link in the code by themselves, which means that it will cache the first element along with several of the following elements (spatial locality) from the array and for some time it will Store them in a cache (temporary locality).

However, some compilers allow you to control the caching, for example, gcc has __ builtin_prefetch , which allows you to control which data should be preprogrammed and whether it should be left in the cache or not.

- Built-in function: void __builtin_prefetch (const void * addr, rw, locality)

This feature is used to minimize cache latency by moving data to the cache before it is available. You can embed __builtin_prefetch calls in code for which you know the data addresses in memory that are likely to be available soon. If the target supports, they generate data prefetch instructions. If the prefetch made early enough before access, then the data will be in the cache at the time of access.

This guide provides an example:

 for (i = 0; i < n; i++) { a[i] = a[i] + b[i]; __builtin_prefetch (&a[i+j], 1, 1); __builtin_prefetch (&b[i+j], 0, 1); /* ... */ } 
0
source

All Articles