OpenMP Concurrency Prevents Vectoring

I am new to OpenMP and I am trying to paralyze the following code using OpenMP:

#pragma omp parallel for for(int k=0;k<m;k++) { for(int j=n-1;j>=0;j--) { outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j]; for(int i=0;i<j;i++) { inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j]; } } } 

Paralyzing the outer loop is pretty straightforward, but to optimize it, I would like to paralyze the loop of the inner loop itself (one iteration through i). But when I try to do it like this:

 #pragma omp parallel for for(int i=0;i<j;i++) { inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j]; } 

the compiler does not vectorize the inner loop ("a loop executed for vectorization due to possible aliasing"), which makes it work more slowly. I compiled it with gcc -ffast-math -std=c++11 -fopenmp -O3 -msse2 -funroll-loops -g -fopt-info-vec prog.cpp

Thanks for any advice!

EDIT: I use the __restrict keyword for arrays.

EDIT2: Interestingly, when I save only the pragma in the inner loop and remove it from the outer, gcc will vectorize it. Therefore, the problem occurs when I try to paralyze both cycles.

EDIT3: the compiler will vectorize a loop when I use #pragma omp parallel for simd. But it is still slower than without a parallel inner loop.

+8
c ++ vectorization openmp
source share
3 answers

Thanks to everyone for the answers. I managed to vectorize the inner loop using #pragma omp parallel for simd , but the program was slower than without parallelization. In the end, I found a slightly different algorithm to solve the problem, which is much faster. Thanks for the help guys!

+1
source share

I assume that after you parallelized the inner loop, your compiler has lost track inA2 , inB2 and outX . By default, it is assumed that any memory locations pointed to by any pointers may intersect with each other. In the C language, the C99 standard introduces the restrict keyword, which informs the compiler that the pointer points to a block of memory that is not indicated by any other pointer. C ++ does not have such a keyword, but, fortunately, g++ has a corresponding extension. Therefore, try adding __restrict to the pointer declarations affected by the loop. For example, replace

 double* outX; 

from

 double* __restrict outX; 
+1
source share

Have you tried to make the inner cycle vecotorzed first? and then adding a parallel part (which can lead to lower performance depending on misses in the cache)

 #pragma omp parallel for for(int k=0;k<m;k++) { for(int j=n-1;j>=0;j--) { outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j]; Q1 = k*n Q2 = n*j Q3 = m*j + k #pragma omp declare simd private(i,j,k,m,Q1,Q2,Q3) linear(i) uniform(outX,inA2,inB2) shared(inB2,inA2,outX) for(int i=0;i<j;i++) { inB2[Q1+i] -= inA2[Q2+i] * outX[Q3]; } } } 

I always need to get the right right part for a while with general, public, etc ... And I have not tested this.

+1
source share

All Articles