Performance test between two directions of a 2D array

This code (A) runs much faster (10 times), and then the second one:

for(int w=0; w<width; w++) { for(int h=1; h<height; h++) { image[h][w] = (1-a)*image[h][w] + a*image[h-1][w]; } } 

Second:

 for(int h=0; h<height; h++) { for(int w=1; w<width; w++) { image[h][w] = (1-a)*image[h][w] + a*image[h][w-1]; } } 

Why? this is the same as all pixels in the image in the horizontal or vertical direction.

Is there a way to speed up the second?

Thanks in advance.

+4
source share
2 answers

This is due to the locale of the link . If you access the elements in the same order that they are stored in memory, it will be much faster than accessing them using the template, as memory caching and memory bandwidth will be used much more efficiently.

The above explains that the second version is faster than the first, and this is exactly what happens in my box:

 aix@aix :~$ time ./ver1 real 0m29.421s aix@aix :~$ time ./ver2 real 0m2.198s 

Here is the code I'm using to allocate an array:

  double a = 0.5; int width = 2048; int height = 2048; double* data = new double[height * width]; double** image = new double*[height]; for (int i = 0; i < height; i++) { image[i] = data + i * width; } 

Version 1 time in the following cycle:

  for (int iter = 0; iter < 100; iter++) { for(int w=0; w<width; w++) { for(int h=1; h<height; h++) { image[h][w] = (1-a)*image[h][w] + a*image[h-1][w]; } } } 

Version 2:

  for (int iter = 0; iter < 100; iter++) { for(int h=0; h<height; h++) { for(int w=1; w<width; w++) { image[h][w] = (1-a)*image[h][w] + a*image[h][w-1]; } } } 

Compiled with g++ 4.4.3 with -O3 and starts in the Xeon field of some description (64-bit Ubuntu).

If you are still 100% sure that you are seeing the effect on the contrary , there should be something fundamentally different in relation to what you are doing compared to what I am doing. This can help if you tell us about the size of your image and how it will be allocated (to help establish a memory layout).

+8
source

aix is ​​right about the locality of the link. To be more explicit, this is due to the hierarchy of memory.

When you first access an item, it may miss a cache. The entire cache line is loaded, then read / write.

Depending on which direction you are moving the array, the next access will be either at location i + 1, or at i + N. i + 1 will probably be on the same cache line, but i + N, usually will be in a different cache line, requiring another large fetch.

For small N, everything ends in the cache, and this does not really matter in terms of direction. For a sufficiently large N, not all arrays can fit into the fastest (and smallest) part of the cache, so the cache line containing the i element can be deleted before you access i + M * N and must be reloaded before calling +1 .

To do this as quickly as possible, you need to talk in detail about the processor architecture. Some of them are more sensitive to alignment than others. Some people prefer that you touch each cache line once (to capacity), and then do the copying. Of course, the growth of timings and the exchange of processors occur.

+1
source

All Articles