According to Agner's instruction tables, one fp split is slower than one inverse op and one multiple op. (This seems to be common among x86 sized architectures)
This is an excerpt from the table for the piledriver architecture.
MULSS MULSD x,x/m 1 5-6 0.5 P01 fma
MULPS MULPD x,x/m 1 5-6 0.5 P01 fma
VMULPS VMULPD y,y,y/m 2 5-6 1 P01 fma
DIVSS DIVPS x,x/m 1 9-24 5-10 P01 fp
VDIVPS y,y,y/m 2 9-24 9-20 P01 fp
DIVSD DIVPD x,x/m 1 9-27 5-10 P01 fp
VDIVPD y,y,y/m 2 9-27 9-18 P01 fp
RCPSS/PS x,x/m 1 5 1 P01 fp
The fourth meaning is latency. Thus, the multiplication of ops takes 5-6, the division of ops takes 9-24, and the inverse op takes 5 cycles. Since 24> 6 + 5, I wonder why 2 separate ops are faster than one single op to get almost the same result.
I suspect that the answer to this question is related to measuring error. Perhaps the fact is that division is much more accurate than mutual plus multiplication. If so, how is error measurement compared? Is there a linear relationship, for example, since division is almost twice as slow as inverse + multiplication, is it twice as accurate?
source
share