I have the following program. nv is about 100, dgemm is 20x100 or so, so there is a lot of work to get around:
#pragma omp parallel for schedule(dynamic,1) for (int c = 0; c < int(nv); ++c) { omp::thread thread; matrix &t3_c = vv_.at(omp::num_threads()+thread); if (terms.first) { blas::gemm(1, t2_, vvvo_, 1, t3_c); blas::gemm(1, vvvo_, t2_, 1, t3_c); } matrix &t3_b = vv_[thread]; if (terms.second) { matrix &t2_ci = vo_[thread]; blas::gemm(-1, t2_ci, Vjk_, 1, t3_c); blas::gemm(-1, t2_ci, Vkj_, 0, t3_b); } }
however, with GCC 4.4, GOMP v1, gomp_barrier_wait_end , almost 50% of the execution time is accounted for. Changing GOMP_SPINCOUNT makes overhead easier, but then only 60% of the cores are used. Same thing for OMP_WAIT_POLICY=passive . The system is Linux, 8 cores.
How can I get full use without hanging / waiting overhread
Anycorn
source share