How can I get the internal value for the exp () function in x64 code?

I have the following code, and I expect the built-in version of the exp() function to be used. Unfortunately, this is not in the x64 assembly, which makes it slower than the analogous Win32 (i.e. 32-bit build):

 #include "stdafx.h" #include <cmath> #include <intrin.h> #include <iostream> int main() { const int NUM_ITERATIONS=10000000; double expNum=0.00001; double result=0.0; for (double i=0;i<NUM_ITERATIONS;++i) { result+=exp(expNum); // <-- The code of interest is here expNum+=0.00001; } // To prevent the above from getting optimized out... std::cout << result << '\n'; } 

I use the following switches for my build:

 /Zi /nologo /W3 /WX- /Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope /Yu"StdAfx.h" /Fp"x64\Release\exp.pch" /FAcs /Fa"x64\Release\" /Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue 

As you can see, I have /Oi , /O2 and /fp:fast , as required, for the MSDN article on inline . Nevertheless, in spite of my efforts, a call to the standard library is called, which makes exp() run slower when building x64.

Here is the generated assembly:

  for (double i=0;i<NUM_ITERATIONS;++i) 000000013F911030 movsd xmm10,mmword ptr [__real@3ff0000000000000 (13F912248h)] 000000013F911039 movapd xmm8,xmm6 000000013F91103E movapd xmm7,xmm9 000000013F911043 movaps xmmword ptr [rsp+20h],xmm11 000000013F911049 movsd xmm11,mmword ptr [__real@416312d000000000 (13F912240h)] { result+=exp(expNum); 000000013F911052 movapd xmm0,xmm7 000000013F911056 call exp (13F911A98h) // ***** exp lib call is here ***** 000000013F91105B addsd xmm8,xmm10 expNum+=0.00001; 000000013F911060 addsd xmm7,xmm9 000000013F911065 comisd xmm8,xmm11 000000013F91106A addsd xmm6,xmm0 000000013F91106E jb main+52h (13F911052h) } 

As you can see in the assembly above, there is a call to the exp() function. Now consider the code generated for this 32-bit for loop:

  for (double i=0;i<NUM_ITERATIONS;++i) 00101031 xorps xmm1,xmm1 00101034 rdtsc 00101036 push ebx 00101037 push esi 00101038 movsd mmword ptr [esp+1Ch],xmm0 0010103E movsd xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)] 00101046 push edi 00101047 mov ebx,eax 00101049 mov dword ptr [esp+3Ch],edx 0010104D movsd mmword ptr [esp+28h],xmm0 00101053 movsd mmword ptr [esp+30h],xmm1 00101059 lea esp,[esp] { result+=exp(expNum); 00101060 call __libm_sse2_exp (101EC0h) // <--- Quite different from 64-bit 00101065 addsd xmm0,mmword ptr [esp+20h] 0010106B movsd xmm1,mmword ptr [esp+30h] 00101071 addsd xmm1,mmword ptr [__real@3ff0000000000000 (102180h)] 00101079 movsd xmm2,mmword ptr [__real@416312d000000000 (102178h)] 00101081 comisd xmm2,xmm1 00101085 movsd mmword ptr [esp+20h],xmm0 expNum+=0.00001; 0010108B movsd xmm0,mmword ptr [esp+28h] 00101091 addsd xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)] 00101099 movsd mmword ptr [esp+28h],xmm0 0010109F movsd mmword ptr [esp+30h],xmm1 001010A5 ja wmain+40h (101060h) } 

Much more code, but faster. The time test that I did on the 3.3 GHz Nehalem-EP host gave the following results:

32-bit:

Average cycle time for a cycle: 34.849229 cycles /10.560373 ns

64-bit:

Average cycle time for the body of the cycle: 45.845323 cycles /13.892522 ns

Very strange behavior. Why is this happening?

Update:

I created a Microsoft Connect error report . Feel free to upgrade it to get an authoritative answer from Microsoft itself about using built-in floating point functions, especially in x64 code.

+9
c ++ visual-c ++ visual-studio-2010 intrinsics visual-c ++ - 2010
source share
3 answers

On x64, floating point arithmetic is performed using SSE. It does not have a built-in operation for exp() and so calling a standard library is inevitable unless you write your own manually built-in vectorized __m128d exp(__m128d) ( The fastest implementation of exponential function using SSE ).

I believe that the MSDN article you are linking to was written using 32-bit code using 8087 FP.

+5
source share

I think the only reason Microsoft provides an embedded version of the 32-bit SSE2 exp () is standard calling conventions. 32-bit calling conventions require the operand to be pushed onto the main stack, and the result must be returned in the uppercase of the FPU stack. If you have SSE2 code generation, then the return value will most likely be pulled from the FPU stack into memory and then loaded from this place into the SSE2 register for any mathematical calculations you want to do. It is clear that it is faster to transfer the operand to the SSE2 register and return the result to the SSE2 register. This is what __libm_sse2_exp () does. In 64-bit code, the standard calling convention transfers the operand and returns the result to the SSE2 registers, so there is no advantage to having a built-in version.

The reason for the performance difference between 32-bit SSE2 and 64-bit exp () implementations is that Microsoft uses different algorithms in the two implementations. I don’t know why they do this, and they produce different results (1ulp different) for some operands.

+1
source share

EDIT I ​​would like to add to this discussion a link to instructions for AMD x64 instructions and an Intel link .

At the initial check, there should be a way to use F2XM1 to calculate the exponent. However, it is in the x87 instruction set, hidden in x64 mode.

There is hope for the use of MMX / x87 explicitly, as described in the publication of the VirtualDub bulletin board. And this is how to actually write asm in VC ++.

0
source share

All Articles