Best gcc optimizers for hyperthreads

Background

I have a four-thread EP (Embarassingly Parallell) C application on my laptop that contains an Intel i5 M 480 running at 2.67 GHz. This processor has two hyper-threaded cores.

Four threads execute the same code for different subsets of data. The code and data have no problems installing in several lines of the cache (completely in L1 with a room for spare parts). The code does not contain divisions, it is essentially connected to the CPU, uses all available registers and performs several memory accesses (outside of L1) to record the results at the end of the sequence.

The mingw64 4.8.1 compiler I e quite recently. The best base optimization level is -O1, which results in four threads that end faster than two. -O2 and higher are slower (two threads complete faster than four, but slower than -O1), like O. Each thread averages 3.37 million sequences every second, which is about 780 clock cycles for each. On average, each sequence performs 25.5 sub-operations or one per 30.6 cycles.

So, what two hyper-threads do in parallel in 30.6 cycles, one thread will execute sequentially for 35-40 or 17.5-20 cycles each.

Where I am

I think I need code that is not so dense / efficient that two hyper-threads constantly collide with local CPU resources.

These switches work quite well (when compiling a module modulo)

-O1 -m64 -mthreads -g -Wall -c -fschedule-insns 

as when compiling one module, which includes all the others

 -O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program 

there are no differences in performance between the two.

Question

Has anyone experimented with this and achieved good results?

+6
source share
4 answers

You say: "I think I need code that is not so dense / efficient that two hyper-threads constantly collide with local CPU resources." This is pretty wrong.

Your CPU has a certain amount of resources. The code will be able to use some resources, but usually not all. Hyperthreading means that you have two threads that can use resources, so a higher percentage of these resources will be used.

What you want is to increase the percentage of resources used. Efficient code will use these resources more efficiently in the first place, and adding hyperthreading can only help. You will not get this speed of acceleration through hyperthreading, but this is due to the fact that you got the acceleration in one thread code, because it was more efficient. If you need the right to boast that hyperthreading has given you great speed, I'm sure to start with inefficient code. If you want maximum speed, start with an efficient code.

Now, if your code was limited to delays, it means that it can execute many useless instructions without a penalty. With hyper threading, these useless instructions are really worth it. Therefore, for hyper-threading, you want to minimize the number of instructions, especially those that were hidden as a result of delays and had no apparent cost in single-threaded code.

+1
source

You can try to block every thread in the kernel using processor affinity . I heard that this can give you a 15% -50% increase in efficiency with some code. The preservation is that when switching the context of the processor in the cache there are fewer changes, etc. This will work better on a machine that just launches your application.

+1
source

It is possible that hyperthreading will be counterproductive. It happens that this is often counterproductive under intense stress.

I would try:

  • disable it at BIOS level and start two threads
  • try to optimize and use SSE / AVX vector extensions, in the end even manually

Explanation: HT is useful because hardware threads more efficiently schedule the execution of software threads. However, both have overhead. Scheduling 2 threads is easier than scheduling 4, and if your code is already "tight", I would try to perform a "denser" execution, optimizing the execution on 2 pipelines as much as possible.

It is clear that if you optimize less, it scales better, but the difficulty will be faster. Therefore, if you are looking for more scalability - this answer is not for you ... but if you are looking for more speed - try it.

As already mentioned, during optimization there is no general solution, otherwise this solution should already be built into compilers.

0
source

You can download the OpenCL or CUDA toolkit and implement the version for your graphics card ... perhaps you can speed it up 100 times with minimal effort.

0
source

All Articles