Can I use FMA instead of ADD to run XMM / YMM FP on Intel Haswell?

This question is for packaged unidirectional floating-point operations with XMM / YMM registers on Haswell.

So, according to the awesome , awesome table compiled by Agner Fog, I know that MUL can be done on any port p0 and p1 (with recp thruput 0.5), and only ADD runs only on port p1 (with repetition 1) . I can eliminate this restriction, BUT I also know that FMA can be performed on any port p0 or p1 (with recp thruput 0.5). Therefore, it is difficult for me to understand why a simple ADD will be limited only to p1, when the FMA can use either p0 or p1, and it performs both ADD and MUL. Do I really not understand the table? Or can someone explain why this would be?

That is, if my reading is correct, why would Intel not just use FMA op as the basis for both simple MUL and simple ADD, and thereby increase the number of ADDs as well as MULs. Alternatively, what will stop me from using two simultaneous independent FMA statements to emulate two simultaneous independent OPD operations? What are the penalties associated with the implementation of ADD-by-FMA? Obviously, more registers are used (2 reg for ADD vs 3 reg for ADD-by-FMA), but other than that?

+5
source share
1 answer

You are not the only one confused why Intel did this. Agner Fog writes for Haswell in his micro-architecture :

It is strange that there is only one port for floating point addition, but two ports for floating point multiplication.

On Agner's bulletin board, he also writes

There are two execution units for floating point multiplication and for smooth multiplication and addition, but only for one execution unit for floating point addition. This design seems suboptimal, since floating point code usually contains more additions than multiplications.

This stream continues with additional information on a topic that I suggest you read, but I will not cite it here.

He also discusses this in this answer here flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

Haswell's FMA latency is 5, and throughput is 2 per clock. This means that you must support 10 concurrent operations to get maximum throughput. If, for example, you want to add a very long list of fp numbers, you will have to split it into ten parts and use ten battery registers.

It is indeed possible, but who would do such a strange optimization for one particular processor?

His answer there basically answers your question. You can use FMA to double the upload bandwidth. In fact, I do this in my bandwidth tests to add and really see that it doubles.

In conclusion, to add, if your calculation is delayed, then do not use ADMA using FMA. But if it's bandwidth, you can try and use FMA (setting the multiplier to 1.0), but for this you will probably have to use many AVX registers.

I deployed 10 times to get maximum throughput here loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell

+5
source

Source: https://habr.com/ru/post/1214684/


All Articles