OpenMP GCC GOMP wasteful barrier

I have the following program. nv is about 100, dgemm is 20x100 or so, so there is a lot of work to get around:

#pragma omp parallel for schedule(dynamic,1) for (int c = 0; c < int(nv); ++c) { omp::thread thread; matrix &t3_c = vv_.at(omp::num_threads()+thread); if (terms.first) { blas::gemm(1, t2_, vvvo_, 1, t3_c); blas::gemm(1, vvvo_, t2_, 1, t3_c); } matrix &t3_b = vv_[thread]; if (terms.second) { matrix &t2_ci = vo_[thread]; blas::gemm(-1, t2_ci, Vjk_, 1, t3_c); blas::gemm(-1, t2_ci, Vkj_, 0, t3_b); } } 

however, with GCC 4.4, GOMP v1, gomp_barrier_wait_end , almost 50% of the execution time is accounted for. Changing GOMP_SPINCOUNT makes overhead easier, but then only 60% of the cores are used. Same thing for OMP_WAIT_POLICY=passive . The system is Linux, 8 cores.

How can I get full use without hanging / waiting overhread

+7
source share
2 answers

A barrier is a symptom, not a problem. The reason that there are a lot of expectations at the end of the loop is because some of the threads execute long before the rest, and they all wait at the end of the for loop for quite some time until everyone does.

This is the classic load imbalance problem, which is strange here because it just multiplies the matrix. Are they of different sizes? How are they stated in memory, from the point of view of NUMA material - are they all now in the same kernel cache or are there other problems with sharing? Or, more simply, there are only 9 matrices, so the remaining 8 are doomed to wait for who will be the last?

When this happens in a large parallel block of code, sometimes you can move on to the next block of code, while some iterations of the loop have not yet been completed; there you can add the nowait directive, for which the default behavior will be redefined and get rid of the implied barrier. Here, however, since the parallel block is exactly the size of the for loop, which may not really help.

+3
source

Maybe your BLAS implementation also calls OpenMP internally? If you do not see only one call to gomp_barrier_wait_end .

+2
source

All Articles