Gcc
[Edit observed behavior of gcc 4.8.4, where the default behavior is the opposite of documentation]
You need to use 80 bit registers in x87 FPU. With -mfpmath=387 you can override the default SSE registers XMM0-XMM7. This, by default, actually gives you IEEE behavior, where 64-bit registers are used at each step.
See: https://gcc.gnu.org/wiki/x87note
Thus, by default, x87 arithmetic does not match the 64/32 bit IEEE, but receives extended precision from the x87 block. However, at any time, the value has moved from the registers to the IEEE storage of 64 or 32 bits; this 80-bit value should be rounded to the corresponding number of bits.
If your operation is extremely complex, however, a registry spill may occur; the FP register stack is only a depth of 8. Therefore, when a leak copies to the RAM cell in size, you get rounding. You need to either declare a long double yourself in this case, or round manually at the end, or check the assembler output for an explicit leak.
More about registers here: https://software.intel.com/en-us/articles/introduction-to-x64-assembly
In particular, the XMM0 ... 7 registers, while 128 bits wide, are intended only for two simultaneous 64-bit FP operations. So you want to see stack-driven FPR registers with the instructions FLD (load), FMUL (multiply) and FSTP (save and pop).
So, I compiled this code:
double mult(double x, double y) { return x * y; }
with:
gcc -mfpmath=387 -Ofast -o precision.s -S precision.c
And received:
mult: .LFB24: .cfi_startproc movsd %xmm1, -8(%rsp) fldl -8(%rsp) movsd %xmm0, -8(%rsp) fldl -8(%rsp) fmulp %st, %st(1) fstpl -8(%rsp) movsd -8(%rsp), %xmm0 ret .cfi_endproc
Now everything makes sense. Floating point values ββare passed through the XMM0 and XMM1 registers (although they have to do weird round trips before they can be pushed onto the FPR stack), and the result is returned to XMM0 according to the Intel link above. Not sure why there is no simple FLD instruction directly from XMM0 / 1, but apparently the command set does not.
If you compare with -mfpmath=sse , in the latter case much less needs to be done, because the operands are ready and waiting in the XMM0 / 1 registers, and this is as simple as a single MULSD command.