How to parallelize a small clean function?

I have a D2 program, which in its current form is single-threaded and calls the same pure function 10 to 100 times in the inner loop for each iteration of the outer loop of this program. There is no data dependency between calls, i.e. No call uses the result from any other call. In general, this function is called millions of times and is the main bottleneck in my program. The options are unique almost every time, so caching will not help.

At first glance, this seems like an ideal candidate for parallelization. The only problem is that the function takes only about 3 microseconds per call, which is much lower than the latency of creating a new thread, and not much higher than the overhead of adding a task to the task pool (which means getting a mutex, allocating memory to store information about the task, deal with possible competition for the task pool queue, etc.). Is there a good way to use parallelism that it's fine-grained?

+5
source share
8 answers

, ? , .

+3

, .

, , ( - ). , , , get queue ...

, , .

, , , , .

"", , , . , , .

, , , .

: , ThreadPoolExecutor. (, ), , , , ( )

+3

, , , "" , . - :

void OuterFunction( Thingy inputData[N] )
{
  for ( int i = 0 ; i < N ; ++i )
    InnerFunction( inputData[i] );
}

(, ):

void JobFunc( Thingy inputData[], int start, int stop )
{
  for ( int i = start ; i < stop ; ++i )
    InnerFunction( inputData[i] );  
}
void OuterFunction( Thingy inputData[N], int numCores )
{
   int perCore = N / numCores; // assuming N%numCores=0 
                               // (omitting edge case for clarity)
   for ( int c = 0 ; c < numCores ; ++c )
     QueueJob( JobFunc, inputData, c * perCore, (c + 1) * perCore );
}

, , ; , , .

, : , . .

SIMD, , . 4- SIMD 16- , , InnerFunction , SSE/VMX.

+2

... , , . , , ... :

, , , , .

, concurrency, , . ? , , . , Amdahl, , - , , , . , , , ( ) , .

, , , , , , . , , , , , , , , . , , at , .

, , , .

+2

. 50 , .

+1

-, SIMD. , 4 , SSE. . , SSE, , .

+1

Compare-and-Swap, :

void OuterFunction()
{
  for(int i = 0; i < N; i++)
    InnerFunction(i);
}

:

void OuterFunction()
{
   int i = 0, j = 0;

   void Go()
   {
      int k;
      while((k = atomicInc(*i)) < N)
      {
         InnerFunction(k);

         atomicInc(*j);
      }
   }

   for(int t = 0; t < ThreadCount - 1; t++) Thread.Start(&Go);

   Go(); // join in

   while(j < N) Wait(); // let everyone else catch up.
}

: , ,

0

, .. .

, , . , -? IO, - , ?

If the answer is yes to these questions, then the previous sentences are fine, just try to maximize the granularity of the application by assigning the function to be executed as much as possible for the flow.

However, you probably won’t get any benefits from massive parallelism, but there may be more modest acceleration ...

0
source

All Articles