The parallel version of the loop is no faster than the serial version

I am writing a C ++ program to simulate a specific system. For each time interval, most of the execution takes one cycle. Fortunately, this confuses the parallel, so I decided to use Boost Threads to parallelize it (I work on a dual-core computer). I expect that during acceleration it will be close to the 2x serial version, as there is no blocking. However, I find that acceleration does not occur at all.

I implemented a parallel version of the loop as follows:

  • Awaken two streams (they are blocked on the barrier).
  • Then each thread does the following:

    • Atomic sampling and increasing the global counter.
    • Get a particle with this index.
    • Perform the calculation on this particle, storing the result in a separate array
    • Wait for it to finish.
  • The main thread is waiting for the specified barrier to complete.

I used this approach, since it should provide good load balancing (since each calculation may take different time intervals). I am very curious about what could lead to this slowdown. I always read that atomic variables are fast, but now I'm starting to wonder if they have their own performance costs.

If anyone has ideas on what to look for or some hints, I would appreciate it. I punched my head for a week, and the profiling did not show much.

: ! , . gprof, (-O3). , , : , .

. , , vtable voila. 2! .

, , , - !

. , .

+5
5

,

?

  • , , .
  • , - (.. ).
  • , (, 100 ..).
  • , .

... , , .

+2

profiling has not revealed much

. HP-UX, , . , , , . pthread_mutex_unlock(). , .

, / . .

( ) , . , - .

+1

:

, xxx

.


, - , . , .
, . , .

+1

, , ( ) .

:

  • .

  • , , . , (, ) , , .

  • 1- . , , , , .

0

OpenMP parallelism? , , , , OMP .

.

0
source

All Articles