I extracted this simple member function from a larger 2D program, all it does is a loop to access from three different arrays and perform a mathematical operation (1D convolution). I tested using OpenMP to speed up this particular function:
void Image::convolve_lines() { const int *ptr0 = tmp_bufs[0]; const int *ptr1 = tmp_bufs[1]; const int *ptr2 = tmp_bufs[2]; const int width = Width; #pragma omp parallel for for ( int x = 0; x < width; ++x ) { const int sum = 0 + 1 * ptr0[x] + 2 * ptr1[x] + 1 * ptr2[x]; output[x] = sum; } }
If I use gcc 4.7 on debian / wheezy amd64, the general program runs much slower on a machine with 8 processors. If I use gcc 4.9 on debian / jessie amd64 (only 4 processors on this machine), the overall program runs with very little difference.
Using time to compare: single core launch:
$ ./test black.pgm out.pgm 94.28s user 6.20s system 84% cpu 1:58.56 total
multicore launch:
$ ./test black.pgm out.pgm 400.49s user 6.73s system 344% cpu 1:58.31 total
Where:
$ head -3 black.pgm P5 65536 65536 255
So, Width set at runtime 65536 .
If this is important, I use cmake to compile:
add_executable(test test.cxx) set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")
And the CMAKE_BUILD_TYPE parameter is set to:
CMAKE_BUILD_TYPE:STRING=Release
where -O3 -DNDEBUG
My question is: why does this for loop not work using multi-core processors? There is no overlap in the array, openmp should split the memory the same way. I donβt see where the bottleneck comes from?
EDIT: as it was commented out, I changed my input file to:
$ head -3 black2.pgm P5 33554432 128 255
So, Width now set to 33554432 at runtime (should be considered enough). Now time shows:
single core launch:
$ ./test ./black2.pgm out.pgm 100.55s user 5.77s system 83% cpu 2:06.86 total
multi-core launch (for some reason, cpu% was always below 100%, which did not indicate a thread at all):
$ ./test ./black2.pgm out.pgm 117.94s user 7.94s system 98% cpu 2:07.63 total