I am working on a language that is compiled with LLVM. Just for fun, I would like to make some microchips. In one, I run a million sin / cos calculations in a loop. In pseudo code, it looks like this:
var x: Double = 0.0 for (i <- 0 to 100 000 000) x = sin(x)^2 + cos(x)^2 return x.toInteger
If I compute sin / cos using the built-in LLVM IR build in the form:
%sc = call { double, double } asm "fsincos", "={st(1)},={st},1,~{dirflag},~{fpsr},~{flags}" (double %"res") nounwind
this is faster than using fsin and fcos separately instead of fsincos. However, it is slower than if I call the llvm.sin.f64 and llvm.cos.f64 separately, which compile for C math lib function calls, at least with the target settings that I use (x86_64 with SSE enabled )
It seems that LLVM inserts some conversions between single / double precision FP - this could be the culprit. Why is this? Sorry, I'm a relative newbie to the assembly:
.globl main .align 16, 0x90 .type main,@function main: # @main .cfi_startproc # BB#0: # %loopEntry1 xorps %xmm0, %xmm0 movl $-1, %eax jmp .LBB44_1 .align 16, 0x90 .LBB44_2: # %then4 # in Loop: Header=BB44_1 Depth=1 movss %xmm0, -4(%rsp) flds -4(%rsp) #APP fsincos #NO_APP fstpl -16(%rsp) fstpl -24(%rsp) movsd -16(%rsp), %xmm0 mulsd %xmm0, %xmm0 cvtsd2ss %xmm0, %xmm1 movsd -24(%rsp), %xmm0 mulsd %xmm0, %xmm0 cvtsd2ss %xmm0, %xmm0 addss %xmm1, %xmm0 .LBB44_1: # %loop2 # =>This Inner Loop Header: Depth=1 incl %eax cmpl $99999999, %eax # imm = 0x5F5E0FF jle .LBB44_2 # BB#3: # %break3 cvttss2si %xmm0, %eax ret .Ltmp160: .size main, .Ltmp160-main .cfi_endproc
The same tests with llvm sin / cos intrinsics calls:
.globl main .align 16, 0x90 .type main,@function main: # @main .cfi_startproc # BB#0: # %loopEntry1 pushq %rbx .Ltmp162: .cfi_def_cfa_offset 16 subq $16, %rsp .Ltmp163: .cfi_def_cfa_offset 32 .Ltmp164: .cfi_offset %rbx, -16 xorps %xmm0, %xmm0 movl $-1, %ebx jmp .LBB44_1 .align 16, 0x90 .LBB44_2: # %then4 # in Loop: Header=BB44_1 Depth=1 movsd %xmm0, (%rsp) # 8-byte Spill callq cos mulsd %xmm0, %xmm0 movsd %xmm0, 8(%rsp) # 8-byte Spill movsd (%rsp), %xmm0 # 8-byte Reload callq sin mulsd %xmm0, %xmm0 addsd 8(%rsp), %xmm0 # 8-byte Folded Reload .LBB44_1: # %loop2 # =>This Inner Loop Header: Depth=1 incl %ebx cmpl $99999999, %ebx # imm = 0x5F5E0FF jle .LBB44_2 # BB#3: # %break3 cvttsd2si %xmm0, %eax addq $16, %rsp popq %rbx ret .Ltmp165: .size main, .Ltmp165-main .cfi_endproc
Can you suggest how the perfect build would look with fsincos? PS. Adding -enable-unsafe-fp-math to llc makes the transformations disappear and switches to doubling (fldl, etc.), but the speed remains the same.
.globl main .align 16, 0x90 .type main,@function main: # @main .cfi_startproc # BB#0: # %loopEntry1 xorps %xmm0, %xmm0 movl $-1, %eax jmp .LBB44_1 .align 16, 0x90 .LBB44_2: # %then4 # in Loop: Header=BB44_1 Depth=1 movsd %xmm0, -8(%rsp) fldl -8(%rsp) #APP fsincos #NO_APP fstpl -24(%rsp) fstpl -16(%rsp) movsd -24(%rsp), %xmm1 mulsd %xmm1, %xmm1 movsd -16(%rsp), %xmm0 mulsd %xmm0, %xmm0 addsd %xmm1, %xmm0 .LBB44_1: # %loop2 # =>This Inner Loop Header: Depth=1 incl %eax cmpl $99999999, %eax # imm = 0x5F5E0FF jle .LBB44_2 # BB#3: # %break3 cvttsd2si %xmm0, %eax ret .Ltmp160: .size main, .Ltmp160-main .cfi_endproc