Floating point division versus floating point multiplication

Is there any (not microoptimization) coding performance gain

float f1 = 200f / 2 

in comparison with

 float f2 = 200f * 0.5 

One of my professors told me a few years ago that floating point divisions were slower than floating point multiplications, not specifying why.

Is this statement supported for modern PC architecture?

Update1

Regarding the comment, please also consider this case:

 float f1; float f2 = 2 float f3 = 3; for( i =0 ; i < 1e8; i++) { f1 = (i * f2 + i / f3) * 0.5; //or divide by 2.0f, respectively } 

Update 2 Quote from comments:

[I want] to know which algorithmic / architectural requirements that cause> division are much more complicated in hardware than multiplication

+57
c ++ floating-point micro-optimization
Nov 08 2018-10-10T00:
source share
7 answers

Yes, many processors can perform multiplication in 1 or 2 clock cycles, but division always takes longer (although FP splitting is sometimes faster than integer splitting).

If you look at this answer , you will see that the separation can exceed 24 cycles.

Why does division take much longer than multiplication? If you remember back to high school, you may recall that multiplication can be accomplished through many simultaneous additions. The department requires an iterative subtraction, which cannot be done simultaneously, so more time is required. In fact, some FP units speed up division by doing the inverse approximation and multiplying by that. This is not entirely accurate, but somewhat faster.

+66
Nov 08 '10 at
source share

Division is inherently much slower than multiplication.

And it may be something that the compiler cannot (and you don’t want to) optimize in many cases due to floating point inaccuracies. These two statements:

 double d1 = 7 / 10.; double d2 = 7 * 0.1; 

are not semantically identical - 0.1 cannot be represented exactly as double , so in the end a slightly different value will be used - replacing the multiplication for division in this case will give a different result!

+16
Nov 08 2018-10-11T20:
source share

Yes. Every FPU that I know about performs multiplications much faster than divisions.

However, modern PCs are very fast. They also contain pipelined artifacts that can be quite significant in many circumstances. To top it all, any worthy compiler will perform the division operation that you showed at compile time, with optimization turned on. For your updated example, any decent compiler would do this conversion.

Generally, you should worry about making your code readable , and let the compiler worry about making it fast. Only if you have a speed question with this line, you need to worry about perverting your code for speed. Compilers are well aware that they are faster than what's on their processors, and are usually much better than optimizers than you can ever hope for.

+9
Nov 08 2018-10-11T13
source share

Think about what it takes to multiply two n bit-numbers. Using the simplest method, you take one number x and repeatedly change and conditionally add it to the drive (based on bits in another number y). After n additions you are done. Your result corresponds to 2n bits.

For division, you start with x of 2n bits and y of n bits, you want to calculate x / y. The simplest method is long division, but in binary. At each step, you perform a comparison and subtraction to get another bit of quotient. This will require n steps.

Some differences: each step of the multiplication should look at only 1 bit; Each division step should look at n bits during the comparison. Each stage of multiplication does not depend on all other stages (it does not matter in what order you add partial products); for division, each step depends on the previous step. This is a big deal in equipment. If something can be done independently, then they can occur simultaneously with the hourly cycle.

+6
Mar 16 '11 at 7:15
source share

Be very careful with separation and avoid it whenever possible. For example, float inverse = 1.0f/divisor; out of the loop and multiply by inverse inside the loop. (If rounding error in inverse acceptable)

Normally 1.0/x will not be exactly represented as float or double . This will be accurate when x is a power of 2. This allows compilers to optimize x/2.0f to x * 0.5f without any change in the result.

To let the compiler do this optimization for you, even if the result is not accurate (or with a runtime variable divider), you will need parameters like gcc -O3 -ffast-math . In particular, -freciprocal-math (enabled with -funsafe-math-optimizations enabled -ffast-math ) allows the compiler to replace x/y x * (1/y) when it is useful. Other compilers have similar options, and ICC may enable some "unsafe" optimization by default (I think so, but I forgot).

-ffast-math often important in order to allow the auto-vector of FP cycles, especially abbreviations (for example, summing an array into one scalar sum), because FP math is not associative. Why doesn't GCC optimize a * a * a * a * a * a to (a * a * a) * (a * a * a)?

Also note that C ++ compilers can flush + and * in FMA in some cases (when compiling for a target supporting it, for example -march=haswell ), but they cannot do this with / .




The unit has lower latency than multiplying or adding (or FMA ) 2-4 times on modern x86 processors and lower throughput of 6 - 40 units (for a hard cycle that performs only division, not just multiplication).

The divide / sqrt module is not fully pipelined, for the reasons stated in @NathanWhitehead's answer . The worst coefficients relate to vectors 256b because (unlike other execution units) the division unit is usually not complete, so wide vectors must be performed in two halves. arith.divider_active pipelined actuator is so unusual that Intel processors have arith.divider_active performance arith.divider_active which helps you find code that is a bottleneck in the bandwidth of the divider instead of the usual bottlenecks in arith.divider_active or arith.divider_active execution. (Or more often, memory bottlenecks or long wait systems that limit concurrency at the instruction level, resulting in reduced instruction throughput in less than 4 hours).

However, the separation of FP and sqrt on Intel and AMD processors (except KNL) is implemented as a single uop, so it does not necessarily have a significant impact on the surrounding code . The best case for division is that out-of-order execution can hide the delay, and when there are many reproductions and additions (or other work) that can happen in parallel with the section.

(An integer unit is micro-encoded as multiple processors on Intel, so it always has a greater effect on the surrounding code, which is multiplied by an integer. There is less demand for high-performance integer division, so the hardware support is not so fantastic. Related: micro-coded instructions like idiv can cause alignment-sensitive interface bottlenecks .)

So, for example, it will be very bad:

 for () a[i] = b[i] / scale; // division throughput bottleneck // Instead, use this: float inv = 1.0 / scale; for () a[i] = b[i] * inv; // multiply (or store) throughput bottleneck 

All you do in the loop is load / divide / store, and they are independent, so bandwidth is important, not latency.

A reduction like accumulator/= b[i] would be a bottleneck for dividing or multiplying latency rather than bandwidth. But with a few batteries that you divide or breed at the end, you can hide the delay and still saturate the bandwidth. Note that sum += a[i]/b[i] bottlenecks with add latency or bandwidth div , but not delay div because division is not on the critical path (dependency chain related to the loop).




But in something similar ( approximating a function such as log(x) with the ratio of two polynomials ), division can be quite cheap :

 for () { // (not shown: extracting the exponent / mantissa) float p = polynomial(b[i], 1.23, -4.56, ...); // FMA chain for a polynomial float q = polynomial(b[i], 3.21, -6.54, ...); a[i] = p/q; } 

For log() in the mantissa range, the ratio of two polynomials of order N has a much smaller error than one polynomial with 2N coefficients, and a parallel estimate of 2 gives you some parallelism at the instruction level inside one cycle body, and not one massive long chain of segments, which makes things a lot easier to do out of order.

In this case, we are not a bottleneck in the delay of division, because out-of-order execution may contain several iterations of the loop over arrays in flight.

We are not a bottleneck in fission bandwidth if our polynomials are large enough and we only have one gap for every 10 FMA instructions or so. (And in the real case of log() , a lot of work is used to extract the exponent / mantissa and reunite things, so there is even more work between divisions.)




When you need to split, it is usually best to split instead of rcpps

x86 has an approximate reverse instruction ( rcpps ) which gives you only 12 bits of accuracy. (AVX512F has 14 bits, and AVX512ER has 28 bits).

You can use this to execute x/y = x * approx_recip(y) without using the actual division instruction. ( rcpps itsef is pretty fast, usually a little slower than multiplication. It uses the table lookup from the internal table in the CPU. The divider apparatus can use the same table for the starting point.)

For most purposes, x * rcpps(y) too inaccurate, and Newton-Raphson iteration is required for double precision. But it costs you 2 multiplications and 2 FMA , and has latency about the same as the actual division instruction. If all you do is divide, then it could be a win in bandwidth. (But you should avoid such loops in the first place, if you can, perhaps by doing a split as part of another loop that does a different job.)

But if you use separation as part of a more complex function, then rcpps itself + optional MUL + FMA usually makes it faster to just split instructions with divps , except for processors with very low bandwidth divps .

(For example, Knight Landing, see below KNL supports the AVX512ER , so for the VRCP28PS float vectors VRCP28PS result is already accurate enough to just multiply without Newton-Raphson iteration. The mantissa float size is only 24 bits.)




Specific numbers from Agner Fog tables:

Unlike any other ALU operation, the latency / bandwidth of a unit is dependent on data from some CPUs. Again, this is because it is so slow and not completely pipelined. Unscheduled planning is easier with fixed delays, since it allows you to avoid conflicts with handling (when the same port of execution tries to produce 2 results in the same cycle, for example, from executing instructions for 3 cycles, and then two operations with one cycle) ,

As a rule, the fastest cases are when the divisor is a β€œround” number, for example, 2.0 or 0.5 (that is, the float base2 representation has many trailing zeros in the mantissa).

delay with float (cycles) / throughput (cycles per instruction, executed only from the reverse side with independent inputs):

  scalar & 128b vector 256b AVX vector divss | mulss divps xmm | mulps vdivps ymm | vmulps ymm Nehalem 7-14 / 7-14 | 5 / 1 (No AVX) Sandybridge 10-14 / 10-14 | 5 / 1 21-29 / 20-28 (3 uops) | 5 / 1 Haswell 10-13 / 7 | 5 / 0.5 18-21 / 14 (3 uops) | 5 / 0.5 Skylake 11 / 3 | 4 / 0.5 11 / 5 (1 uop) | 4 / 0.5 Piledriver 9-24 / 5-10 | 5-6 / 0.5 9-24 / 9-20 (2 uops) | 5-6 / 1 (2 uops) Ryzen 10 / 3 | 3 / 0.5 10 / 6 (2 uops) | 3 / 1 (2 uops) Low-power CPUs: Jaguar(scalar) 14 / 14 | 2 / 1 Jaguar 19 / 19 | 2 / 1 38 / 38 (2 uops) | 2 / 2 (2 uops) Silvermont(scalar) 19 / 17 | 4 / 1 Silvermont 39 / 39 (6 uops) | 5 / 2 (No AVX) KNL(scalar) 27 / 17 (3 uops) | 6 / 0.5 KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512) 

double latency (cycles) / throughput (cycles per instruction):

  scalar & 128b vector 256b AVX vector divsd | mulsd divpd xmm | mulpd vdivpd ymm | vmulpd ymm Nehalem 7-22 / 7-22 | 5 / 1 (No AVX) Sandybridge 10-22 / 10-22 | 5 / 1 21-45 / 20-44 (3 uops) | 5 / 1 Haswell 10-20 / 8-14 | 5 / 0.5 19-35 / 16-28 (3 uops) | 5 / 0.5 Skylake 13-14 / 4 | 4 / 0.5 13-14 / 8 (1 uop) | 4 / 0.5 Piledriver 9-27 / 5-10 | 5-6 / 1 9-27 / 9-18 (2 uops) | 5-6 / 1 (2 uops) Ryzen 8-13 / 4-5 | 4 / 0.5 8-13 / 8-9 (2 uops) | 4 / 1 (2 uops) Low power CPUs: Jaguar 19 / 19 | 4 / 2 38 / 38 (2 uops) | 4 / 2 (2 uops) Silvermont(scalar) 34 / 32 | 5 / 2 Silvermont 69 / 69 (6 uops) | 5 / 2 (No AVX) KNL(scalar) 42 / 42 (3 uops) | 6 / 0.5 (Yes, Agner really lists scalar as slower than packed, but fewer uops) KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512) 

Ivibridge and Broadwell are also different, but I wanted to keep the table small. (Core2 (before Nehalem) has better divider performance, but its maximum clock speeds were lower.)

Atom, Silvermont, and even Knight Landing (Xeon Phi based on Silvermont) have exceptionally low fission performance , and even a 128b vector is slower than a scalar one. The AMD low-power Jaguar CPU (used on some consoles) is similar. A high performance divider takes up a lot of space. Xeon Phi has a low core power, and packing a large number of cores on a matrix gives it more stringent space limitations, which are Skylake-AVX512. It seems that the AVX512ER rcp28ps / pd is what you "intend" to use in KNL.

(See this InstLatx64 result for Skylake-AVX512, as well as Skylake-X. The numbers for vdivps zmm are 18c / 10c, so half the bandwidth is ymm .)




Long latent chains become a problem when they are carried in a loop, or when they are so long that they stop execution out of order from looking for parallelism with other independent work.




Footnote 1: how I compiled this div and mul relationship:

FP separation compared to multiple performance factors is even worse than low-power processors such as Silvermont and Jaguar, and even in Xeon Phi (KNL, where you should use the AVX512ER).

Actual division / multiplication factors for scalar (without vectorization) double : 8 on Ryzen and Skylake with their reinforced separators, but 16-28 on Haswell (data-dependent and more likely by the end of 28 cycles if your divisors are not round numbers). These modern processors have very powerful splitters, but their throughput with a double frequency doubles it. (Especially when your code can automatically vectorize using AVI 256b vectors). Also note that with the correct compiler options, these multiple throughputs also apply to FMA.

Numbers from the http://agner.org/optimize/ instruction tables for Intel Haswell / Skylake and AMD Ryzen for the SSE scanner (not including x87 fmul / fdiv ) and for 256-bit AVX SIMD vectors float or double . See also x86 wiki.

+6
Aug 26 '17 at 20:00
source share

Newton rhapson solves integer division by complexity O (M (n)) by approximating linear algebra. Faster than O (n * n) complexity.

In code, the method contains 10 million 9adds 2bitwiseshifts.

This explains why division is about 12 times as many processor cycles as multiplication.

+2
Apr 02 '16 at 16:30
source share

The answer depends on the platform for which you are programming.

For example, executing a large number of multiplications on an x86 array should be much faster than performing the division, because the compiler must create assembler code that uses SIMD instructions. Since there is no separation in the SIMD instructions, then you will see big improvements using multiplication, then division.

+1
Nov 08 2018-10-10T00:
source share



All Articles