Why is this for a loop not faster using OpenMP?

I extracted this simple member function from a larger 2D program, all it does is a loop to access from three different arrays and perform a mathematical operation (1D convolution). I tested using OpenMP to speed up this particular function:

void Image::convolve_lines() { const int *ptr0 = tmp_bufs[0]; const int *ptr1 = tmp_bufs[1]; const int *ptr2 = tmp_bufs[2]; const int width = Width; #pragma omp parallel for for ( int x = 0; x < width; ++x ) { const int sum = 0 + 1 * ptr0[x] + 2 * ptr1[x] + 1 * ptr2[x]; output[x] = sum; } } 

If I use gcc 4.7 on debian / wheezy amd64, the general program runs much slower on a machine with 8 processors. If I use gcc 4.9 on debian / jessie amd64 (only 4 processors on this machine), the overall program runs with very little difference.

Using time to compare: single core launch:

 $ ./test black.pgm out.pgm 94.28s user 6.20s system 84% cpu 1:58.56 total 

multicore launch:

 $ ./test black.pgm out.pgm 400.49s user 6.73s system 344% cpu 1:58.31 total 

Where:

 $ head -3 black.pgm P5 65536 65536 255 

So, Width set at runtime 65536 .

If this is important, I use cmake to compile:

 add_executable(test test.cxx) set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp") 

And the CMAKE_BUILD_TYPE parameter is set to:

 CMAKE_BUILD_TYPE:STRING=Release 

where -O3 -DNDEBUG

My question is: why does this for loop not work using multi-core processors? There is no overlap in the array, openmp should split the memory the same way. I don’t see where the bottleneck comes from?

EDIT: as it was commented out, I changed my input file to:

 $ head -3 black2.pgm P5 33554432 128 255 

So, Width now set to 33554432 at runtime (should be considered enough). Now time shows:

single core launch:

 $ ./test ./black2.pgm out.pgm 100.55s user 5.77s system 83% cpu 2:06.86 total 

multi-core launch (for some reason, cpu% was always below 100%, which did not indicate a thread at all):

 $ ./test ./black2.pgm out.pgm 117.94s user 7.94s system 98% cpu 2:07.63 total 
+5
source share
1 answer

I have general comments:

1. Before optimizing the code, make sure that the data is 16 byte aligned. This is extremely important for any optimization you want to apply. And if the data is divided into 3 parts, it is better to add some dummy elements so that the starting addresses of the three parts are aligned by 16 bytes. Thus, the CPU can easily load your data into cache lines.

2. Before implementing openMP, make sure that a simple function is vectorized. In most cases, using the AVX / SSE instruction sets should give you a decent improvement with 2 to 8X single stream. And this is very simple for your case: create a constant mm256 register and set it to 2 and load 8 integers into three mm256 registers. With the Haswell processor, you can make one addition and one multiplication. Theoretically, the cycle should be accelerated 12 times if the AVX pipeline can be full!

3. Sometimes parallelization can degrade performance . A modern processor requires from several hundred to thousands of clocks to warm up, entering into high-performance states and increasing the frequency. If the task is not large enough, it is very likely that the task will be completed before the processor warms up, and you cannot increase the speed by going in parallel. And don't forget that openMP also has overhead: thread creation, synchronization, and deletion. Another example is poor memory management. Data access is so scattered that all CPU cores are idle and expect data from RAM.

My suggestion:

You might want to try intel MKL, do not reinvent the wheel. . The library is optimized to an extreme level, and there is no clock cycle. It can be associated with a serial library or a parallel version, speed acceleration is guaranteed, if possible, by going in parallel.

+2
source

Source: https://habr.com/ru/post/1210905/


All Articles