I have what I understand as a relatively simple OpenMP construct. The problem is that the program runs about 100-300x faster with 1 thread compared to 2 threads. 87% of the program is spent on gomp_send_wait()and another 9.5% on gomp_send_post.
The program gives the correct results, but I'm wondering if there is a flaw in the code that causes some resource conflict, or simply because the overhead of creating the stream does not cost it sharply for cycle 4. It pvaries from 17 to 1000, depending on the size of the molecule which we are modeling.
My numbers refer to the worst case where p is 17 and the block size is 4. The performance is the same, regardless of whether I use static, dynamic or managed scheduling. With p=150and block size, the 75program is still 75x100x slower than the serial one.
...
double e_t_sum=0.0;
double e_in_sum=0.0;
int nthreads,tid;
for (i = 0; i < p; i++){
if (i != c){
nthreads = omp_get_num_threads();
tid = omp_get_thread_num();
d_x = V_in[i].x - t_x;
d_y = V_in[i].y - t_y;
d_z = V_in[i].z - t_z;
rr = d_x * d_x + d_y * d_y + d_z * d_z;
if (i < c){
ee_t[i][c] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
e_t_sum += ee_t[i][c];
e_in_sum += ee_in[i][c];
}
else{
ee_t[c][i] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
e_t_sum += ee_t[c][i];
e_in_sum += ee_in[c][i];
}
// if(pid==0){printf("e_t_sum[%d]: %f\n", tid, e_t_sum[tid]);}
}
}//end parallel for
e_t += e_t_sum;
e_t -= e_in_sum;
...
Mason e kramer
source
share