C ++: improving cache performance in a 3d array

I do not know how to optimize cache performance at a really low level, thinking about the size or associativity of the cache line. This is not something you can learn in one night. Given that my program will work on many different systems and architectures, I donโ€™t think that itโ€™s the same. However, there may be some steps that I can take to reduce cache misses in general.

Here is a description of my problem:

I have a three-dimensional array of integers representing values โ€‹โ€‹at points in space, for example [x] [y] [z]. Each dimension is the same size as the cube. From this I need to make another 3D array, where each value in this new array is a function of 7 parameters: the corresponding value in the original three-dimensional array plus 6 indices that โ€œtouchโ€ it in space. I'm not worried about the edges and corners of the cube right now.

Here is what I mean in C ++ code:

void process3DArray (int input[LENGTH][LENGTH][LENGTH], int output[LENGTH][LENGTH][LENGTH]) { for(int i = 1; i < LENGTH-1; i++) for (int j = 1; j < LENGTH-1; j++) for (int k = 1; k < LENGTH-1; k++) //The for loops start at 1 and stop before LENGTH-1 //or other-wise I'll get out-of-bounds errors //I'm not concerned with the edges and corners of the //3d array "cube" at the moment. { int value = input[i][j][k]; //I am expecting crazy cache misses here: int posX = input[i+1] [j] [k]; int negX = input[i-1] [j] [k]; int posY = input[i] [j+1] [k]; int negY = input[i] [j-1] [k]; int posZ = input[i] [j] [k+1]; int negZ = input[i] [j] [k-1]; output [i][j][k] = process(value, posX, negX, posY, negY, posZ, negZ); } } 

However, it seems that if LENGTH is big enough, I get tons of cache misses when I get the parameters for the process . Is there a way to make it easier to use the cache, or is there a better way to represent my data besides a 3D array?

And if you have time to answer these additional questions, do I need to consider the value of LENGTH? I like that LENGTH is 20 versus 100 versus 10,000. Also, would I need to do something else if I used something other than integers, for example, maybe a 64-byte structure?

@ildjarn:

Sorry, I did not think that the code that generates the arrays that I pass to process3DArray matters. But if so, I would like to know why.

 int main() { int data[LENGTH][LENGTH][LENGTH]; for(int i = 0; i < LENGTH; i++) for (int j = 0; j < LENGTH; j++) for (int k = 0; k < LENGTH; k++) data[i][j][k] = rand() * (i + j + k); int result[LENGTH][LENGTH][LENGTH]; process3DArray(data, result); } 
+7
source share
2 answers

Most importantly, what you already have. If you use Fortran, you will be doing it exactly wrong, but that's a different story. You are correct in that you are processing in the inner loop in the direction where the memory addresses are located closer to each other. One memory sample (outside the cache) will pull several values โ€‹โ€‹corresponding to a number of neighboring values โ€‹โ€‹of k. Inside the loop, the cache will contain a number of values โ€‹โ€‹from i, j; a similar number from i +/- 1, j and from i, j +/- 1. Thus, you basically have five disjoint memory sections. At low LENGTH values, this will be only 1 or three sections of memory. It is in the way that caches are created that you can have more than these many disjoint memory partitions in your active set.

I hope the process () will be small and inline. Otherwise, this may be negligible. In addition, this will affect whether your code fits in the command cache.

Since you're interested in performance, it's almost always best to initialize five pointers (you only need one for the value, posZ and negZ), and then take * (p ++) inside the loop.

 input[i+1] [j] [k]; 

asks the compiler to generate 3 additions and two multiplications if you do not have a very good optimizer. If your compiler is especially lazy about register allocation, you also get four memory accesses; otherwise.

 *inputIplusOneJK++ 

requests one addition and a link to memory.

+3
source

There is an answer to a similar question there: https://stackoverflow.com/a/3129603/

The main goal of optimizing multidimensional array traversal is to make sure that you visit the array in such a way that, as a rule, you reuse the cache lines received from the previous iteration step. To visit each element of the array once and only once, you can do this simply by visiting in memory order (as you do in your loop).

Since you are doing something more complex than simply traversing an element (visiting an element plus 6 neighbors), you need to break your traversal so that you do not have access to too many cache lines at once. Since j and k movements prevail during cache overflows, you just need to change the traversal so that you visit blocks at the same time, and not lines at a time.

eg:.

 const int CACHE_LINE_STEP= 8; void process3DArray (int input[LENGTH][LENGTH][LENGTH], int output[LENGTH][LENGTH][LENGTH]) { for(int i = 1; i < LENGTH-1; i++) for (int k_start = 1, k_next= CACHE_LINE_STEP; k_start < LENGTH-1; k_start= k_next; k_next+= CACHE_LINE_STEP) { int k_end= min(k_next, LENGTH - 1); for (int j = 1; j < LENGTH-1; j++) //The for loops start at 1 and stop before LENGTH-1 //or other-wise I'll get out-of-bounds errors //I'm not concerned with the edges and corners of the //3d array "cube" at the moment. { for (int k= k_start; k<k_end; ++k) { int value = input[i][j][k]; //I am expecting crazy cache misses here: int posX = input[i+1] [j] [k]; int negX = input[i-1] [j] [k]; int posY = input[i] [j+1] [k]; int negY = input[i] [j-1] [k]; int posZ = input[i] [j] [k+1]; int negZ = input[i] [j] [k-1]; output [i][j][k] = process(value, posX, negX, posY, negY, posZ, negZ); } } } } 

What this does is, make sure that you don't break the cache by visiting the grid in a block-oriented fashion (in fact, more like a thick, column-oriented style, limited by the size of the cache line). This is not ideal since there are overlaps that intersect the cache lines between columns, but you can tweak it to make it better.

+7
source

All Articles