Why fp division op is slower than reciprocal op plus multiply op

According to Agner's instruction tables, one fp split is slower than one inverse op and one multiple op. (This seems to be common among x86 sized architectures)

This is an excerpt from the table for the piledriver architecture.

MULSS MULSD    x,x/m    1  5-6   0.5   P01  fma
MULPS MULPD    x,x/m    1  5-6   0.5   P01  fma
VMULPS VMULPD  y,y,y/m  2  5-6   1     P01  fma
DIVSS DIVPS    x,x/m    1  9-24  5-10  P01  fp
VDIVPS         y,y,y/m  2  9-24  9-20  P01  fp
DIVSD DIVPD    x,x/m    1  9-27  5-10  P01  fp
VDIVPD         y,y,y/m  2  9-27  9-18  P01  fp
RCPSS/PS       x,x/m    1  5     1     P01  fp

The fourth meaning is latency. Thus, the multiplication of ops takes 5-6, the division of ops takes 9-24, and the inverse op takes 5 cycles. Since 24> 6 + 5, I wonder why 2 separate ops are faster than one single op to get almost the same result.

I suspect that the answer to this question is related to measuring error. Perhaps the fact is that division is much more accurate than mutual plus multiplication. If so, how is error measurement compared? Is there a linear relationship, for example, since division is almost twice as slow as inverse + multiplication, is it twice as accurate?

+4
source share
1 answer

IIRC, fast approximate cross-division, and sqrt instructions are basically a search in a table (from an internal table) without iterative refinement, which makes exact division / sqrt slow and difficult for the pipeline. That is why / how they are implemented with a bandwidth of 1 time.

, divss , , , Skylake FP/sqrt .


, , rsqrt, . SSE sqrt (x) , rsqrt (x) * x?

( , )

+4

All Articles