Be very careful with separation and avoid it whenever possible. For example, float inverse = 1.0f/divisor; out of the loop and multiply by inverse inside the loop. (If rounding error in inverse acceptable)
Normally 1.0/x will not be exactly represented as float or double . This will be accurate when x is a power of 2. This allows compilers to optimize x/2.0f to x * 0.5f without any change in the result.
To let the compiler do this optimization for you, even if the result is not accurate (or with a runtime variable divider), you will need parameters like gcc -O3 -ffast-math . In particular, -freciprocal-math (enabled with -funsafe-math-optimizations enabled -ffast-math ) allows the compiler to replace x/y x * (1/y) when it is useful. Other compilers have similar options, and ICC may enable some "unsafe" optimization by default (I think so, but I forgot).
-ffast-math often important in order to allow the auto-vector of FP cycles, especially abbreviations (for example, summing an array into one scalar sum), because FP math is not associative. Why doesn't GCC optimize a * a * a * a * a * a to (a * a * a) * (a * a * a)?
Also note that C ++ compilers can flush + and * in FMA in some cases (when compiling for a target supporting it, for example -march=haswell ), but they cannot do this with / .
The unit has lower latency than multiplying or adding (or FMA ) 2-4 times on modern x86 processors and lower throughput of 6 - 40 units (for a hard cycle that performs only division, not just multiplication).
The divide / sqrt module is not fully pipelined, for the reasons stated in @NathanWhitehead's answer . The worst coefficients relate to vectors 256b because (unlike other execution units) the division unit is usually not complete, so wide vectors must be performed in two halves. arith.divider_active pipelined actuator is so unusual that Intel processors have arith.divider_active performance arith.divider_active which helps you find code that is a bottleneck in the bandwidth of the divider instead of the usual bottlenecks in arith.divider_active or arith.divider_active execution. (Or more often, memory bottlenecks or long wait systems that limit concurrency at the instruction level, resulting in reduced instruction throughput in less than 4 hours).
However, the separation of FP and sqrt on Intel and AMD processors (except KNL) is implemented as a single uop, so it does not necessarily have a significant impact on the surrounding code . The best case for division is that out-of-order execution can hide the delay, and when there are many reproductions and additions (or other work) that can happen in parallel with the section.
(An integer unit is micro-encoded as multiple processors on Intel, so it always has a greater effect on the surrounding code, which is multiplied by an integer. There is less demand for high-performance integer division, so the hardware support is not so fantastic. Related: micro-coded instructions like idiv can cause alignment-sensitive interface bottlenecks .)
So, for example, it will be very bad:
for () a[i] = b[i] / scale; // division throughput bottleneck // Instead, use this: float inv = 1.0 / scale; for () a[i] = b[i] * inv; // multiply (or store) throughput bottleneck
All you do in the loop is load / divide / store, and they are independent, so bandwidth is important, not latency.
A reduction like accumulator/= b[i] would be a bottleneck for dividing or multiplying latency rather than bandwidth. But with a few batteries that you divide or breed at the end, you can hide the delay and still saturate the bandwidth. Note that sum += a[i]/b[i] bottlenecks with add latency or bandwidth div , but not delay div because division is not on the critical path (dependency chain related to the loop).
But in something similar ( approximating a function such as log(x) with the ratio of two polynomials ), division can be quite cheap :
for () {
For log() in the mantissa range, the ratio of two polynomials of order N has a much smaller error than one polynomial with 2N coefficients, and a parallel estimate of 2 gives you some parallelism at the instruction level inside one cycle body, and not one massive long chain of segments, which makes things a lot easier to do out of order.
In this case, we are not a bottleneck in the delay of division, because out-of-order execution may contain several iterations of the loop over arrays in flight.
We are not a bottleneck in fission bandwidth if our polynomials are large enough and we only have one gap for every 10 FMA instructions or so. (And in the real case of log() , a lot of work is used to extract the exponent / mantissa and reunite things, so there is even more work between divisions.)
When you need to split, it is usually best to split instead of rcpps
x86 has an approximate reverse instruction ( rcpps ) which gives you only 12 bits of accuracy. (AVX512F has 14 bits, and AVX512ER has 28 bits).
You can use this to execute x/y = x * approx_recip(y) without using the actual division instruction. ( rcpps itsef is pretty fast, usually a little slower than multiplication. It uses the table lookup from the internal table in the CPU. The divider apparatus can use the same table for the starting point.)
For most purposes, x * rcpps(y) too inaccurate, and Newton-Raphson iteration is required for double precision. But it costs you 2 multiplications and 2 FMA , and has latency about the same as the actual division instruction. If all you do is divide, then it could be a win in bandwidth. (But you should avoid such loops in the first place, if you can, perhaps by doing a split as part of another loop that does a different job.)
But if you use separation as part of a more complex function, then rcpps itself + optional MUL + FMA usually makes it faster to just split instructions with divps , except for processors with very low bandwidth divps .
(For example, Knight Landing, see below KNL supports the AVX512ER , so for the VRCP28PS float vectors VRCP28PS result is already accurate enough to just multiply without Newton-Raphson iteration. The mantissa float size is only 24 bits.)
Specific numbers from Agner Fog tables:
Unlike any other ALU operation, the latency / bandwidth of a unit is dependent on data from some CPUs. Again, this is because it is so slow and not completely pipelined. Unscheduled planning is easier with fixed delays, since it allows you to avoid conflicts with handling (when the same port of execution tries to produce 2 results in the same cycle, for example, from executing instructions for 3 cycles, and then two operations with one cycle) ,
As a rule, the fastest cases are when the divisor is a βroundβ number, for example, 2.0 or 0.5 (that is, the float base2 representation has many trailing zeros in the mantissa).
delay with float (cycles) / throughput (cycles per instruction, executed only from the reverse side with independent inputs):
scalar & 128b vector 256b AVX vector divss | mulss divps xmm | mulps vdivps ymm | vmulps ymm Nehalem 7-14 / 7-14 | 5 / 1 (No AVX) Sandybridge 10-14 / 10-14 | 5 / 1 21-29 / 20-28 (3 uops) | 5 / 1 Haswell 10-13 / 7 | 5 / 0.5 18-21 / 14 (3 uops) | 5 / 0.5 Skylake 11 / 3 | 4 / 0.5 11 / 5 (1 uop) | 4 / 0.5 Piledriver 9-24 / 5-10 | 5-6 / 0.5 9-24 / 9-20 (2 uops) | 5-6 / 1 (2 uops) Ryzen 10 / 3 | 3 / 0.5 10 / 6 (2 uops) | 3 / 1 (2 uops) Low-power CPUs: Jaguar(scalar) 14 / 14 | 2 / 1 Jaguar 19 / 19 | 2 / 1 38 / 38 (2 uops) | 2 / 2 (2 uops) Silvermont(scalar) 19 / 17 | 4 / 1 Silvermont 39 / 39 (6 uops) | 5 / 2 (No AVX) KNL(scalar) 27 / 17 (3 uops) | 6 / 0.5 KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512)
double latency (cycles) / throughput (cycles per instruction):
scalar & 128b vector 256b AVX vector divsd | mulsd divpd xmm | mulpd vdivpd ymm | vmulpd ymm Nehalem 7-22 / 7-22 | 5 / 1 (No AVX) Sandybridge 10-22 / 10-22 | 5 / 1 21-45 / 20-44 (3 uops) | 5 / 1 Haswell 10-20 / 8-14 | 5 / 0.5 19-35 / 16-28 (3 uops) | 5 / 0.5 Skylake 13-14 / 4 | 4 / 0.5 13-14 / 8 (1 uop) | 4 / 0.5 Piledriver 9-27 / 5-10 | 5-6 / 1 9-27 / 9-18 (2 uops) | 5-6 / 1 (2 uops) Ryzen 8-13 / 4-5 | 4 / 0.5 8-13 / 8-9 (2 uops) | 4 / 1 (2 uops) Low power CPUs: Jaguar 19 / 19 | 4 / 2 38 / 38 (2 uops) | 4 / 2 (2 uops) Silvermont(scalar) 34 / 32 | 5 / 2 Silvermont 69 / 69 (6 uops) | 5 / 2 (No AVX) KNL(scalar) 42 / 42 (3 uops) | 6 / 0.5 (Yes, Agner really lists scalar as slower than packed, but fewer uops) KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512)
Ivibridge and Broadwell are also different, but I wanted to keep the table small. (Core2 (before Nehalem) has better divider performance, but its maximum clock speeds were lower.)
Atom, Silvermont, and even Knight Landing (Xeon Phi based on Silvermont) have exceptionally low fission performance , and even a 128b vector is slower than a scalar one. The AMD low-power Jaguar CPU (used on some consoles) is similar. A high performance divider takes up a lot of space. Xeon Phi has a low core power, and packing a large number of cores on a matrix gives it more stringent space limitations, which are Skylake-AVX512. It seems that the AVX512ER rcp28ps / pd is what you "intend" to use in KNL.
(See this InstLatx64 result for Skylake-AVX512, as well as Skylake-X. The numbers for vdivps zmm are 18c / 10c, so half the bandwidth is ymm .)
Long latent chains become a problem when they are carried in a loop, or when they are so long that they stop execution out of order from looking for parallelism with other independent work.
Footnote 1: how I compiled this div and mul relationship:
FP separation compared to multiple performance factors is even worse than low-power processors such as Silvermont and Jaguar, and even in Xeon Phi (KNL, where you should use the AVX512ER).
Actual division / multiplication factors for scalar (without vectorization) double : 8 on Ryzen and Skylake with their reinforced separators, but 16-28 on Haswell (data-dependent and more likely by the end of 28 cycles if your divisors are not round numbers). These modern processors have very powerful splitters, but their throughput with a double frequency doubles it. (Especially when your code can automatically vectorize using AVI 256b vectors). Also note that with the correct compiler options, these multiple throughputs also apply to FMA.
Numbers from the http://agner.org/optimize/ instruction tables for Intel Haswell / Skylake and AMD Ryzen for the SSE scanner (not including x87 fmul / fdiv ) and for 256-bit AVX SIMD vectors float or double . See also x86 wiki.