You are not the only one confused why Intel did this. Agner Fog writes for Haswell in his micro-architecture :
It is strange that there is only one port for floating point addition, but two ports for floating point multiplication.
On Agner's bulletin board, he also writes
There are two execution units for floating point multiplication and for smooth multiplication and addition, but only for one execution unit for floating point addition. This design seems suboptimal, since floating point code usually contains more additions than multiplications.
This stream continues with additional information on a topic that I suggest you read, but I will not cite it here.
He also discusses this in this answer here flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2
Haswell's FMA latency is 5, and throughput is 2 per clock. This means that you must support 10 concurrent operations to get maximum throughput. If, for example, you want to add a very long list of fp numbers, you will have to split it into ten parts and use ten battery registers.
It is indeed possible, but who would do such a strange optimization for one particular processor?
His answer there basically answers your question. You can use FMA to double the upload bandwidth. In fact, I do this in my bandwidth tests to add and really see that it doubles.
In conclusion, to add, if your calculation is delayed, then do not use ADMA using FMA. But if it's bandwidth, you can try and use FMA (setting the multiplier to 1.0), but for this you will probably have to use many AVX registers.
I deployed 10 times to get maximum throughput here loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell