Get much less than expected from slicing - why?

I have an evaluation of a function that is somewhat slow. I am trying to speed it up using threads, as there are three things that can be done in parallel. Single threaded version

return dEdx_short(E) + dEdx_long(E) + dEdx_quantum(E); 

where an estimate of these functions takes ~ 250us, ~ 250us, and ~ 100us, respectively. Therefore, I implemented a three-threaded solution:

 double ret_short, ret_long, ret_quantum; // return values for the terms auto shortF = [this,&E,&ret_short] () {ret_short = this->dEdx_short(E);}; std::thread t1(shortF); auto longF = [this,&E,&ret_long] () {ret_long = this->dEdx_long(E);}; std::thread t2(longF); auto quantumF = [this,&E,&ret_quantum] () {ret_quantum = this->dEdx_quantum(E);}; std::thread t3(quantumF); t1.join(); t2.join(); t3.join(); return ret_short + ret_long + ret_quantum; 

Which I expected to take ~ 300us, but actually it takes ~ 600us - basically the same as the single-threaded version! All of them are essentially thread-safe, so there are no expectations for locks. I checked the thread creation time on my system and it is ~ 25us. I don't use all of my cores, so I'm a little puzzled as to why the parallel solution is so slow. Does this have anything to do with creating lambda?

I tried to get around lambda, for example:

 std::thread t1(&StopPow_BPS::dEdx_short, this, E, ret_short); 

after overwriting the called function, but that gave me an error attempt to use a deleted function ...

+7
c ++ multithreading lambda c ++ 11
source share
1 answer

You may be experiencing a false exchange . To check, save the return values ​​in a type that uses the entire cache line (the size depends on the CPU).

 const int cacheLineSize = 64; // bytes union CacheFriendly { double value; char dummy[cacheLineSize]; } ret_short, ret_long, ret_quantum; // return values for the terms // ... 
+1
source share

All Articles