What is the value of streaming concurrency overhead at the output of the profiler?

Question

What is the value of streaming concurrency overhead at the output of the profiler?

I would be very grateful if someone with good Intel VTune Amplifier experience would tell me about this.

I recently received a performance analysis report from other guys who used Intel VTune Amplifier against my program. He reports that there is a high consignment note in the concurrency flow area.

What does overhead time mean? They do not know (they asked me), I do not have access to Intel VTune Amplifier.

I have vague ideas. This program has many sleep calls because the pthread condition unstable (or I did it badly) on the target platform, so I change many routines to do the work in a loop, as shown below:

 while (true) { mutex.lock(); if (event changed) { mutex.unlock(); // do something break; } else { mutex.unlock(); usleep(3 * 1000); } }

Can it be marked as Overhead time ?

Any tips?

I found help documentation on overtime from the Intel website. http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/win/ug_docs/olh/common/overhead_time.html#overhead_time

Excerpts:

Overhead is the duration that begins with the release of a shared resource and ends with the receipt of that resource. Ideally, the duration of the Overhead is very short, since it reduces the time that a thread must wait to receive a resource. However, not all processor time in a parallel application can be spent on real work with payment. In cases where parallel runtimes (Intel® Threading Building Blocks, OpenMP *) are used inefficiently, a significant part of the time can be spent inside a parallel runtime that loses processor time at high concurrency levels. For example, this may be due to the low detail of the work, split in recursive parallel algorithms: when the size of the workload becomes too low, the overhead of dividing the work and doing the job becomes significant.

Still confusing .. Could this mean that you made an unnecessary / too frequent lock?

+8

c ++ c multithreading profiling

9dan Feb 09 '11 at 8:05

source share

3 answers

zebediah49 · Answer 1 · 2011-02-23T23:03:00+0000

I am not very good at it either, although I have already tried a bit to use pthread .

To demonstrate my understanding of overhead, let's take an example of a simple single-threaded program to calculate the amount of an array:

 for(i=0;i<NUM;i++) { sum += array[i]; }

In a simple [reasonably made] multi-threaded version of this code, an array can be split into one part per thread, each thread saves its own sum, and after the threads are executed, the sums are summed.

In a very poorly written multithreaded version, the array can be split as before, and each thread can atomicAdd to a global sum.

In this case, the addition of an atom can be performed only one stream at a time. I believe that overhead is an indicator of how long all other threads spend waiting for their own atomicAdd (you can try writing this program to see if you want to be sure).

Of course, it also takes into account the time it takes to switch semaphores and mutexes. In your case, this probably means a significant amount of time is wasted on the internals of mutex.lock and mutex.unlock.

I parallelized a piece of software some time ago (using pthread_barrier ) and had problems when it took more time to execute barriers than just using a single thread. It turned out that the cycle, which was supposed to have 4 barriers in it, was carried out fast enough so that the overhead was not worth it.

Daren thomas · Answer 2 · 2011-02-09T08:14:09+0000

Sorry, I'm not an expert in pthread or Intel VTune Amplifier, but yes, locking the mutex and unlocking it will probably be considered official time.

Locking and unlocking mutexes can be implemented as system calls, which, probably, the profiler, may simply break into streaming service data.

Patrick · Answer 3 · 2011-02-09T16:07:58+0000

I am not familiar with vTune, but there is a switch between threads in OS. Each time a thread stops and another one is loaded onto the processor, the current thread context must be saved so that it can be restored the next time the thread starts, and then a new thread context must be restored so that it can continue processing.

The problem may be that you have too many threads, so the processor spends most of its time switching between them. Multithreaded applications will work most efficiently if the number of threads as processors is the same.

What is the value of streaming concurrency overhead at the output of the profiler?

More articles: