Horrible performance - a simple overhead problem, or is there a software flaw?

I have what I understand as a relatively simple OpenMP construct. The problem is that the program runs about 100-300x faster with 1 thread compared to 2 threads. 87% of the program is spent on gomp_send_wait()and another 9.5% on gomp_send_post.

The program gives the correct results, but I'm wondering if there is a flaw in the code that causes some resource conflict, or simply because the overhead of creating the stream does not cost it sharply for cycle 4. It pvaries from 17 to 1000, depending on the size of the molecule which we are modeling.

My numbers refer to the worst case where p is 17 and the block size is 4. The performance is the same, regardless of whether I use static, dynamic or managed scheduling. With p=150and block size, the 75program is still 75x100x slower than the serial one.

...
    double e_t_sum=0.0;
    double e_in_sum=0.0;

    int nthreads,tid;

    #pragma omp parallel for schedule(static, 4) reduction(+ : e_t_sum, e_in_sum) shared(ee_t) private(tid, i, d_x, d_y, d_z, rr,) firstprivate( V_in, t_x, t_y, t_z) lastprivate(nthreads)
    for (i = 0; i < p; i++){
        if (i != c){
            nthreads = omp_get_num_threads();               
            tid = omp_get_thread_num();

            d_x = V_in[i].x - t_x; 
            d_y = V_in[i].y - t_y;
            d_z = V_in[i].z - t_z;


            rr = d_x * d_x + d_y * d_y + d_z * d_z;

            if (i < c){

                ee_t[i][c] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
                e_t_sum += ee_t[i][c]; 
                e_in_sum += ee_in[i][c];    
            }
            else{

                ee_t[c][i] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
                e_t_sum += ee_t[c][i]; 
                e_in_sum += ee_in[c][i];    
            }

            // if(pid==0){printf("e_t_sum[%d]: %f\n", tid, e_t_sum[tid]);}

        }
    }//end parallel for 


        e_t += e_t_sum;
        e_t -= e_in_sum;            

...
+5
source share
6 answers

First of all, I don’t think that optimizing your serial code in this case will help answer your question in OpenMP dilemna. Do not worry about it.

IMO there are three possible explanations for the slowdown:

  • . ee_t . , , , , , ( ). , google. ee_t .

  • , parallelism. 8 ? 2 ?

  • , , 17. 8 , (, ( == c). , 3 , do 2. , , , , , . , 1 openmp. , , .

, .

+6

, , , , . . . . . .

1 : . . (, , OS// .. ) . (, Sun C) "", , , , . (. -xprofile)

- , , . .

, , , . , .

, //.

. OpenMP Solaris

+2

, ( ifs) , < c, i > c. , parallelism, , , , n.

+1

. , . . .

, OpenMP . , .

. . .

+1

-, . , , , .

: GOMP ( ), , , . e_t_sum e_in_sum nthreads e_t_sum[tid] , .

, , , . , , , .

: , ee_t. , . , i > c, , , i < c.

+1

GNU- openmp. . Intel Linux, .

, , - , , . , V_in, .

I would say that this is one of those two problems that is your problem.

0
source

All Articles