SSE and hyperthreading

Are SSE registers registered together or duplicated between logical processors (hyperthreading)? Can we expect the same acceleration from parallelization for a heavy SSE program as for a regular program (Intel claims that 30% for processors with a hyperthread)?

+4
source share
2 answers

It is not clear to me from the Intel documentation if Hyperthreading processors exchange a register file between threads or have two different ones (I would assume that they are really different, because otherwise the context switching time between HT threads will be quite high, but this is purely a guess).

As for acceleration, it will depend on your combination of teams and planning. Remember that the HT processor does not have additional execution resources (ALUs, load / storage units, etc.). The performance improvement is associated with better possibilities for using these resources, since typical code, especially on a modern processor, spends a reasonable amount of time blocked, waiting for memory to load and to complete execution before execution can continue. HT allows you to move these loads and storage so that one thread hangs while reading, the other can switch and start using execution resources that were previously absent.

I would have guessed what kind of performance increase that you will see with multi-threading of the SSE program will depend on the ratio of memory operating elements to arithmetic operations. If, for example, your SSE program loads 4 SSE registers from memory, performs 10,000 SSE operations on them, and then writes 4 registers back, you are unlikely to see most of the benefits of HTs that can block memory access, because 99% of the time the execution of your programs will be spent in ALU SIMD, not memory access.

On the other hand, if your program is very complex, then multithreading your program can significantly increase the performance of multi-core processors and can give you much more than a 30% improvement, since in this case your code can access the full execution resources of several cores simultaneously .

+3
source

They are logically duplicated - each thread gets its own state. Physically, they can be separated - it depends on your implementation of the hyperflow.

+3
source

All Articles