Std :: pow produces different results in a 32-bit and 64-bit application

Question

Std :: pow produces different results in a 32-bit and 64-bit application

I found a mismatch as a result of some complex calculations. When I carefully observe the intermediate results, it is the std :: pow function that creates this mismatch. Below are the inputs / outputs.

long double dvalue = 2.7182818284589998; long double dexp = -0.21074699576017999; long double result = std::powl( dvalue, dexp);

64bit → result = 0.80997896907296496 and 32bit → result = + 0.80997896907296507

I am using VS2008. I tried with another variation of the pow function, which takes a long double and returns a long double, but still sees the same difference.

double pow( double base, double exponent );

long double powl( long double base, long double exponent );

I read some information about this:

Intel x86 processors use internal 80-bit extended precision, whereas double is usually 64-bit. Different optimization levels affect how often floating point values from the CPU are stored in memory and thus rounded from 80-bit precision to 64-bit precision. On the other hand, use the long double type, which is usually 80 bits per gcc, to avoid rounding from 80-bit to 64-bit precision.

Can someone make me clearly understand the difference and how to overcome this difference.

+6

c ++ x86-64 c ++ 11

user1512 Nov 24 '15 at 6:46

source share

4 answers

Adrian mccarthy · Answer 1 · 2015-11-24T18:55:21+0000

What is likely to happen is that the 32-bit assembly uses 80-bit FPU registers for computation, and the 64-bit assembly uses SIMD operations using 64-bit values, which causes a slight discrepancy. Note that both answers are consistent with 14 decimal places, and this is the best you can hope for with 64-bit floating point values.

Visual C ++ offers compiler options that let you tell you if you prefer speed, consistency, or precision with respect to floating point operations. Using these options (e.g. /fp:strict ), you can get consistent values between the two assemblies if that matters to you.

Also note that VC ++ 2008 is quite old. Newer versions have fixes for many bugs, including some related to floating point. (In popular versions of strtod errors have been discovered and fixed in open source software since 2008.) In addition to differences in accuracy from 80-bit and 64-bit operations, you may also encounter parsing and error mapping. However, floating point is complicated and errors persist .

Hurkyl · Answer 2 · 2015-11-24T07:00:17+0000

The most important thing you need to know about floating point calculations is that they are (almost always) inaccurate. Most numbers cannot be represented exactly as floating point numbers. And even if the result of the calculation can be represented accurately, the result that is actually calculated may still not be entirely accurate.

A way to handle this is to write code that is independent of getting accurate results. For example, you should almost never check floating point numbers for equality. If you need to check if the number is positive, your program may have to reject very small positive numbers (they are approximately negative) or accept extremely small negative numbers (they are approximately positive).

Similarly, you should avoid numerically unstable algorithms, as these small errors explode quickly; on the contrary, you should try to use numerically robust algorithms, since they are error resistant.

How well to do numerical calculations is a whole area of research!

Cubbi · Answer 3 · 2015-11-24T19:41:53+0000

You used literals of type double , not long double (you forgot the suffix). This means that when you wrote 2.7182818284589998 (impossible value for double ), the compiler had to choose between 2.718281828458999793696193592040799558162689208984375 and 2.71828182845899934960698374197818338871002197265625 ,

and when you wrote -0.21074699576017999 (another impossible value for double ), the compiler had to choose between -0.2107469957601799948054832611887832172214984893798828125 and -0.210746995760179967049907645559869706630706787109375 .

When rounding to the nearest value, the values stored in dvalue and dexp were 2.718281828458999793696193592040799558162689208984375 and -0.2107469957601799948054832611887832172214984893798828125 (keeping the double in its long value does not change)

The result of the pow should be close to 0.8099789690729650165287354526069381795064774873497553965297999359066924950079080502973738475702702999114990234375 , which should then be 0.8099789690729650165287354526069381795064774873497553965297999359066924950079080502973738475702702999114990234375 in your case should be long double (except as I recall

placing the result in a 64-bit double , we must choose between 0.80997896907296496049610823320108465850353240966796875 and 0.80997896907296507151841069571673870086669921875 .

The correct answer (rounding to the nearest) is 0.80997896907296507151841069571673870086669921875 , and this is exactly what you got in the “32-bit result”, truncated as 0.80997896907296507 .

Your "64-bit result" looks like exactly another 64-bit double value, incorrectly rounded off from the correct result (and truncated as 0.80997896907296496 ). I would think that the QoI error: gcc, clang, intel and oracle give the only, correct result (even if they should not: IEEE accuracy requirements for pow allow more than 0.5 ulp errors)

By the way, if your pow returned an 80-bit long double Intel, it should have corresponded between 0.8099789690729650164951504420773886749884695746004581451416015625 and 0.809978969072965016549360550701663896688842214643955230712890625 , the last one.

sbabbi · Answer 4 · 2015-11-24T20:14:09+0000

On the wikipedia page on a long double

In x86 architecture, most C compilers implement long double as an 80-bit extended precision type supported by x86 hardware [...]. The exception is Microsoft Visual C ++ for x86, which makes a double double synonym for double.

So, when you compile a 32-bit long double = double , but on x64 long double is actually an 80-bit floating point, so the results are different.

Std :: pow produces different results in a 32-bit and 64-bit application

More articles: