Parallel Acceleration with OpenMP

I have two scenarios for measuring indicators such as calculation time and parallel acceleration (sequential_time / parallel_time).

Scenario 1:

Sequential time measurement:

startTime=omp_get_wtime(); for loop computation endTime=omp_get_wtime(); seq_time = endTime-startTime; 

Parallel time measurement:

 startTime = omp_get_wtime(); for loop computation (#pragma omp parallel for reduction (+:pi) private (i) for (blah blah) { computation; } endTime=omp_get_wtime(); paralleltime = endTime-startTime; speedup = seq_time/paralleltime; 

Scenario 2:

Sequential time measurement:

 for loop{ startTime=omp_get_wtime(); computation; endTime=omp_get_wtime(); seq_time += endTime-startTime; } 

Parallel time measurement:

 for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime) for (blah blah) { startTime=omp_get_wtime(); computation; endTime=omp_get_wtime(); paralleltime = endTime-startTime; } speedup = seq_time/paralleltime; 

I know that scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance due to OVERLOOKING overhead associated with creating openmp and managing (thread switching) multiple threads. Thus, this will give us linear acceleration. But Scenario 1 addresses the overhead associated with spawning and flow control.

My doubt is this: With Scenario 1, I get an acceleration that starts linear, but narrows when we go to more iterations. With scenario 2, I get full linear acceleration regardless of the number of iterations. I was told that actually scenario 1 will give me linear acceleration, regardless of the number of iterations. But I think this is not due to high congestion due to flow control. Can someone explain to me why I'm wrong?

Thanks! And it is a pity that a rather long post.

+4
source share
4 answers

There are many situations where scenario 2 will not give you linear acceleration: false sharing of threads (or, for that matter, the truthful separation of shared variables that change), limiting memory bandwidth, etc. Sublinear acceleration is usually real, not an artifact of measurement.

More generally, once you get to the point where you set the timers within the loops, you look at more accurate time information than is really necessary to measure using such timers. You might want to get rid of the cost of managing the flow of actual work being done for various reasons, but here you are trying to do this by inserting N additional function calls into omp_get_wtime() , as well as arithmetic and reduction operation, all of which will be non-essential overhead.

If you really need the exact time spent on computation; time computation; , you really want to use something like selective rather than manual tools (we talk a little about the difference here ). Using gprof or scalasca or openspeedshop (all free software) or Intel VTune or something (a commercial package) will give you information on how much time is spent on this line - often even by stream - with much less overhead.

+4
source

First of all, by the definition of acceleration, you should use scenario 1, which includes parallel overhead.


In scenario 2, you have the wrong code when measuring paralleltime . To satisfy your goal in scenario 2, you need to have a paralleltime stream by highlighting int paralleltime[NUM_THREADS] and accessing them with omp_get_thread_num() (note that this code will have shared access to the logs, so you better allocate 64 - byte struct with addition). Then measure the computation time by flow and finally take the longest one to calculate another type of acceleration (I would say something like parallelism).


No, you can see sublinear acceleration even for scenario 2 or even get superlinear acceleration. Potential reasons (i.e. excluding concurrent overhead):

  • Load imbalance: the length of the workload in compuation is different from iteration. This will be the most common reason for low speed (but you say that the load imbalance is not the same).
  • The cost of synchronization: if there is any synchronization (for example, a mutex, event, barrier), you may have a wait / block time.
  • Cache and memory cost: when computation requires a lot of bandwidth and a high set of working sets, parallel code can suffer from bandwidth limitations (although this is actually rare) and cache conflicts. In addition, false sharing would be a significant reason, but it is easy to avoid. A superlinear effect can also be observed since using a multi-core device may have more caches (i.e. private L1 / L2 caches).

In scenario 1, it will contain the overhead of parallel libraries:

  • Formation / association of flows: although most parallel library implementations will not create / complete the creation of a physical stream on each parallel construction.
  • Sending / combining logical tasks: even if physical threads are already created, you need to send a logical task to each thread (in general, an M-task to an N-thread), and also perform some connection operation at the end (for example, an implicit barrier).
  • Planning Overhead: For static planning (as shown in your code that uses OpenMP static planning), the overhead is minimal. You can safely ignore overhead when the workload is enough (say 0.1 seconds). However, dynamic scheduling (such as stealing a job at TBB) has some overhead, but it is not significant once your workload is sufficient.

I don’t think your code (single-level parallel static scheduling loop) has high parallel overhead due to flow control if this code is not called a million times per second. Therefore, there may be other reasons that I mentioned above.

Keep in mind that there are many factors that will determine acceleration; from built-in parallelism (= load and synchronization imbalance) to the overhead of the parallel library (for example, planning overhead).

+5
source

What do you want to accurately measure? The overhead associated with parallelism is part of real-time execution, so the IMHO 1 script is better.

Also, according to your OpenMP directives, you are doing a reduction on some kind of array. In scenario 1, you consider this. In scenario 2 you are not. So basically you measure fewer things than in Scenario 1. This probably also affects your measurements.

Otherwise, Jonathan Dursi's answer is excellent.

+2
source

OpenMP has several options for how work is distributed between threads. This may affect the linearity of your measurement method. 1. Measurement method 2 does not seem to be useful. What were you trying to handle? If you want to know the performance of one thread, run one thread. If you need concurrent performance, you need to include overhead.

0
source

All Articles