Invoking an fsincos statement in LLVM is slower than invoking libc sin / cos?

Question

Invoking an fsincos statement in LLVM is slower than invoking libc sin / cos?

I am working on a language that is compiled with LLVM. Just for fun, I would like to make some microchips. In one, I run a million sin / cos calculations in a loop. In pseudo code, it looks like this:

var x: Double = 0.0 for (i <- 0 to 100 000 000) x = sin(x)^2 + cos(x)^2 return x.toInteger

If I compute sin / cos using the built-in LLVM IR build in the form:

 %sc = call { double, double } asm "fsincos", "={st(1)},={st},1,~{dirflag},~{fpsr},~{flags}" (double %"res") nounwind

this is faster than using fsin and fcos separately instead of fsincos. However, it is slower than if I call the llvm.sin.f64 and llvm.cos.f64 separately, which compile for C math lib function calls, at least with the target settings that I use (x86_64 with SSE enabled )

It seems that LLVM inserts some conversions between single / double precision FP - this could be the culprit. Why is this? Sorry, I'm a relative newbie to the assembly:

  .globl main .align 16, 0x90 .type main,@function main: # @main .cfi_startproc # BB#0: # %loopEntry1 xorps %xmm0, %xmm0 movl $-1, %eax jmp .LBB44_1 .align 16, 0x90 .LBB44_2: # %then4 # in Loop: Header=BB44_1 Depth=1 movss %xmm0, -4(%rsp) flds -4(%rsp) #APP fsincos #NO_APP fstpl -16(%rsp) fstpl -24(%rsp) movsd -16(%rsp), %xmm0 mulsd %xmm0, %xmm0 cvtsd2ss %xmm0, %xmm1 movsd -24(%rsp), %xmm0 mulsd %xmm0, %xmm0 cvtsd2ss %xmm0, %xmm0 addss %xmm1, %xmm0 .LBB44_1: # %loop2 # =>This Inner Loop Header: Depth=1 incl %eax cmpl $99999999, %eax # imm = 0x5F5E0FF jle .LBB44_2 # BB#3: # %break3 cvttss2si %xmm0, %eax ret .Ltmp160: .size main, .Ltmp160-main .cfi_endproc

The same tests with llvm sin / cos intrinsics calls:

  .globl main .align 16, 0x90 .type main,@function main: # @main .cfi_startproc # BB#0: # %loopEntry1 pushq %rbx .Ltmp162: .cfi_def_cfa_offset 16 subq $16, %rsp .Ltmp163: .cfi_def_cfa_offset 32 .Ltmp164: .cfi_offset %rbx, -16 xorps %xmm0, %xmm0 movl $-1, %ebx jmp .LBB44_1 .align 16, 0x90 .LBB44_2: # %then4 # in Loop: Header=BB44_1 Depth=1 movsd %xmm0, (%rsp) # 8-byte Spill callq cos mulsd %xmm0, %xmm0 movsd %xmm0, 8(%rsp) # 8-byte Spill movsd (%rsp), %xmm0 # 8-byte Reload callq sin mulsd %xmm0, %xmm0 addsd 8(%rsp), %xmm0 # 8-byte Folded Reload .LBB44_1: # %loop2 # =>This Inner Loop Header: Depth=1 incl %ebx cmpl $99999999, %ebx # imm = 0x5F5E0FF jle .LBB44_2 # BB#3: # %break3 cvttsd2si %xmm0, %eax addq $16, %rsp popq %rbx ret .Ltmp165: .size main, .Ltmp165-main .cfi_endproc

Can you suggest how the perfect build would look with fsincos? PS. Adding -enable-unsafe-fp-math to llc makes the transformations disappear and switches to doubling (fldl, etc.), but the speed remains the same.

  .globl main .align 16, 0x90 .type main,@function main: # @main .cfi_startproc # BB#0: # %loopEntry1 xorps %xmm0, %xmm0 movl $-1, %eax jmp .LBB44_1 .align 16, 0x90 .LBB44_2: # %then4 # in Loop: Header=BB44_1 Depth=1 movsd %xmm0, -8(%rsp) fldl -8(%rsp) #APP fsincos #NO_APP fstpl -24(%rsp) fstpl -16(%rsp) movsd -24(%rsp), %xmm1 mulsd %xmm1, %xmm1 movsd -16(%rsp), %xmm0 mulsd %xmm0, %xmm0 addsd %xmm1, %xmm0 .LBB44_1: # %loop2 # =>This Inner Loop Header: Depth=1 incl %eax cmpl $99999999, %eax # imm = 0x5F5E0FF jle .LBB44_2 # BB#3: # %break3 cvttsd2si %xmm0, %eax ret .Ltmp160: .size main, .Ltmp160-main .cfi_endproc

+6

assembly inline-assembly llvm x87

Erkki lindpere 18 sept. '12 at 21:18

source share

1 answer

George Koehler · Answer 1 · 2014-06-28T20:44:31+0000

Hardware trigger is slow.

Too many documents claim that x87 instructions, such as fsin or fsincos , are the fastest way to perform trigonometric functions. These statements are often erroneous.

The fastest way depends on your processor. When processors get faster, older hardware triggers like fsin don't accelerate. With some processors, a software function that uses a polynomial approximation for a sine or other trigger function is now faster than a hardware instruction.

In short, fsincos is too slow.

Hardware trigger is out of date.

There is enough evidence that the x86-64 platform has moved away from the hardware trigger.

amd64 prefers SSE over x87 for floats. However, SSE has no equivalents for x87 instructions such as fsin .
For amd64, libm on FreeBSD and glibc implements sin () and such functions in the software, not with the x87 trigger. glibc has an optimized x86-64 assembly for sinf () (single precision sine) with polynomial approximation, not with x87 fsin . NetBSD and OpenBSD made the opposite choice: their libm for amd64 really uses x87 instructions.
Steel Bank Common Lisp uses fsin in its x86 backend , but not in its x86-64 backend. For x86-64, SBCL compiles code that calls sin () in libm .

The hardware trigger loses the race.

I have calculated hardware and software on the AMD Phenom II X2 560 (3.3 GHz) since 2010. I wrote a C program with this loop:

 volatile double a, s; /* ... */ for (i = 0; i < 100000000; i++) s = sin(a);

I compiled this program twice, with two different implementations of sin (). Hard sin () uses x87 fsin . Mild sin () uses a polynomial approximation. My C compiler, gcc -O2 , did not replace my sin () call with the built-in fsin .

Here are the results for sin (0.5):

 $ time race-hard 0.5 0m3.40s real 0m3.40s user 0m0.00s system $ time race-soft 0.5 0m1.13s real 0m1.15s user 0m0.00s system

Here soft sin (0.5) is so fast that this processor will do soft sin (0.5) and soft cos (0.5) faster than one x87 fsin .

And for sin (123):

 $ time race-hard 123 0m3.61s real 0m3.62s user 0m0.00s system $ time race-soft 123 0m3.08s real 0m3.07s user 0m0.01s system

Soft sin (123) is slower than soft sin (0.5), because 123 is too large for the polynomial, so the function should subtract a few multiples of 2π. If I also need cos (123), the likelihood that x87 fsincos will be faster than soft sin (123) and soft cos (123) is for this CPU since 2010.

Invoking an fsincos statement in LLVM is slower than invoking libc sin / cos?

Hardware trigger is slow.

Hardware trigger is out of date.

The hardware trigger loses the race.

More articles: