I am not very good at it either, although I have already tried a bit to use pthread .
To demonstrate my understanding of overhead, let's take an example of a simple single-threaded program to calculate the amount of an array:
for(i=0;i<NUM;i++) { sum += array[i]; }
In a simple [reasonably made] multi-threaded version of this code, an array can be split into one part per thread, each thread saves its own sum, and after the threads are executed, the sums are summed.
In a very poorly written multithreaded version, the array can be split as before, and each thread can atomicAdd to a global sum.
In this case, the addition of an atom can be performed only one stream at a time. I believe that overhead is an indicator of how long all other threads spend waiting for their own atomicAdd (you can try writing this program to see if you want to be sure).
Of course, it also takes into account the time it takes to switch semaphores and mutexes. In your case, this probably means a significant amount of time is wasted on the internals of mutex.lock and mutex.unlock.
I parallelized a piece of software some time ago (using pthread_barrier ) and had problems when it took more time to execute barriers than just using a single thread. It turned out that the cycle, which was supposed to have 4 barriers in it, was carried out fast enough so that the overhead was not worth it.
zebediah49
source share