Background
I have a four-thread EP (Embarassingly Parallell) C application on my laptop that contains an Intel i5 M 480 running at 2.67 GHz. This processor has two hyper-threaded cores.
Four threads execute the same code for different subsets of data. The code and data have no problems installing in several lines of the cache (completely in L1 with a room for spare parts). The code does not contain divisions, it is essentially connected to the CPU, uses all available registers and performs several memory accesses (outside of L1) to record the results at the end of the sequence.
The mingw64 4.8.1 compiler I e quite recently. The best base optimization level is -O1, which results in four threads that end faster than two. -O2 and higher are slower (two threads complete faster than four, but slower than -O1), like O. Each thread averages 3.37 million sequences every second, which is about 780 clock cycles for each. On average, each sequence performs 25.5 sub-operations or one per 30.6 cycles.
So, what two hyper-threads do in parallel in 30.6 cycles, one thread will execute sequentially for 35-40 or 17.5-20 cycles each.
Where I am
I think I need code that is not so dense / efficient that two hyper-threads constantly collide with local CPU resources.
These switches work quite well (when compiling a module modulo)
-O1 -m64 -mthreads -g -Wall -c -fschedule-insns
as when compiling one module, which includes all the others
-O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program
there are no differences in performance between the two.
Question
Has anyone experimented with this and achieved good results?
source share