Is duplication of public resources optimal for hyper-flows?

Question

Is duplication of public resources optimal for hyper-flows?

This question has an answer that reads:

Hyper-threading duplicates internal resources to reduce contextual switching time. Resources can be: registers, arithmetic unit, cache.

Why were CPU designers in a state of duplication of state resources for simultaneous multithreading (or hyperthreads on Intel)?

Why not triple (four times, etc.) the same resources that give us three logical cores and, therefore, even higher throughput?

Is this a duplication in which the researchers came in a sense optimal , or is it just a reflection of current capabilities (transistor size, etc.)?

+3

performance multithreading cpu cpu-registers hyperthreading

Leo Heinsaar Mar 02 '16 at 13:16

source share

1 answer

Peter Cordes · Accepted Answer · 2016-10-11 18:21

The answer you are quoting sounds incorrect. Hyperthreading competes with existing ALU files, cache and physical register.

Running two threads simultaneously on the same core allows you to find more parallelism so that these execution units are powered by work instead of sitting in standby mode, waiting for cache miss errors, delay and branch failures.

Only a few things should be physically replicated or partitioned in order to track the architectural state of two processors in one core and basically in the interface (prior to the release / rename stage). David Kanter Haswell writeup shows how Sandybridge always shared the IDQ (decoded-queue that passes the release / rename stage), but IvyBridge and Haswell can use this as one large queue when only one thread is active. It also describes how the cache competes between threads. For example, the Haswell kernel has 168 physical integer registers , but only 16 is required for the architectural state of each logical processor. (Out-of-order execution for each thread, of course, benefits from a large number of registers, so registration of renaming to a large file of the physical register is performed in the first turn.)

Modern Intel processors have so many execution units that you can hardly saturate them with a carefully configured code that does not have any kiosks and runs 4 processors with a smooth domain per cycle. This is very rare in practice, outside of something like a matrix multiplied by a BLAS manual library.

Most of the benefits of code are from HT, because it cannot saturate a complete kernel by itself, so existing resources of one core can run two threads faster than half the speed. (Usually significantly faster than half).

But when only one thread is running, the full power of a large core is available for that thread. This is what you lose if you are developing a multi-core processor with many small cores. If Intel processors had not implemented hyperthreading, they probably would not have included as many execution units for a single thread. This helps for several single-threaded workloads, but helps a lot more with HT. Therefore, you can argue that this is a case of ALU replication, because the design supports HT, but this is not essential.

Pentium 4 actually did not have enough resources to run two full threads without losing more than you got. A trace cache may be part of this, but it also does not have nearly the number of execution units. P4 with HT helped to use prefetch streams that do nothing but prefetch data from an array, which the main thread loops over, as described / recommended in What Every Programmer Should Know About Memory (which is still useful and relevant otherwise). In the prefetch stream, there is a small trace of the trace and the L1D cache fetch used by the main stream. This is what happens when you implement HT without enough runtime resources to really do it well.

HT does not help at all for code that is a bottleneck in the bandwidth of the main FMA peak or something like that (supporting 10 FMA in flight with 10 vector batteries). This can even damage the code, which ultimately slows down the performance of extra cache misses caused by contention in the L1D and L2 caches with another thread. (As well as uop cache and L1I cache).

Agar Fog microarch pdf says the same thing.

Paul Clayton comments on this issue, also make some good points regarding SMT projects in general.

Is duplication of public resources optimal for hyper-flows?

More articles: