Unusual floating point behavior and no extra variables, why?

Question

Unusual floating point behavior and no extra variables, why?

When I run the following code in VC ++ 2013 (32-bit, without optimization):

#include <cmath> #include <iostream> #include <limits> double mulpow10(double const value, int const pow10) { static double const table[] = { 1E+000, 1E+001, 1E+002, 1E+003, 1E+004, 1E+005, 1E+006, 1E+007, 1E+008, 1E+009, 1E+010, 1E+011, 1E+012, 1E+013, 1E+014, 1E+015, 1E+016, 1E+017, 1E+018, 1E+019, }; return pow10 < 0 ? value / table[-pow10] : value * table[+pow10]; } int main(void) { double d = 9710908999.008999; int j_max = std::numeric_limits<double>::max_digits10; while (j_max > 0 && ( static_cast<double>( static_cast<unsigned long long>( mulpow10(d, j_max))) != mulpow10(d, j_max))) { --j_max; } double x = std::floor(d * 1.0E9); unsigned long long y1 = x; unsigned long long y2 = std::floor(d * 1.0E9); std::cout << "x == " << x << std::endl << "y1 == " << y1 << std::endl << "y2 == " << y2 << std::endl; }

I get

 x == 9.7109089990089994e+018 y1 == 9710908999008999424 y2 == 9223372036854775808

in the debugger.

I am angry. Can someone explain to me how heck y1 and y2 have different meanings?

Update:

This only happens in /Arch:SSE2 or /Arch:AVX , not /Arch:IA32 or /Arch:SSE .

+8

c ++ double floating-point visual-c ++ unsigned-long-long-int

Mehrdad Jan 31 '14 at 11:25

source share

4 answers

9223372036854775808 - 0x8000000000000000 ; that is, it is equal to INT64_MIN , uint64_t from uint64_t .

It looks like your compiler returns the return value of floor in a long long , and then returns this result in an unsigned long long .

Please note that for overflow in floating point and integral conversions, it is natural to get the smallest representable value (for example, cvttsd2siq on x86-64):

When the conversion is inaccurate, a truncated result is returned. If the converted result is greater than the maximum double-signed sign, an invalid floating-point exception is returned, and if this exception is masked, an undefined integer value (80000000H) is returned.

(This is from the double word documentation, but the behavior in four words is the same.)

+4

ecatmur Jan 31 '14 at 11:53

source share

Hypothesis: this is a mistake. The compiler correctly converts double to unsigned long long , but incorrectly converts floating point with extended precision (possibly long double ) to unsigned long long . Details:

 double x = std::floor(9710908999.0089989 * 1.0E9);

This computes the value on the right side and stores it in x . The value on the right side can be calculated with extended precision, but, as the C ++ rules require, converted to double when stored in x . The exact mathematical value will be 9710908999008998870, but rounding it to double gives 9710908999008999424.

 unsigned long long y1 = x;

This converts the double value in x to unsigned long long , creating the expected 9710908999008999424.

 unsigned long long y2 = std::floor(9710908999.0089989 * 1.0E9);

This calculates the value on the right side using extended precision, creating 9710908999008998870. When the extended precision value is converted to an unsigned long long , an error occurs that produces 2 ⁶³ (9223372036854775808). This value is likely to be an out-of-range error value created by an instruction that converts the extended precision format to a 64-bit integer. The compiler used the wrong sequence of commands to convert the extended precision format to unsigned long long .

+3

Eric Postpischil Jan 31 '14 at 12:08

source share

You selected y1 as double before you switch it to long again. the x value is not a "floor" value, but a rounded value for the floor.

The same logic will apply with integers and floats. float x = (float) ((int) 1.5) will give a different value for float x = 1.5

0

James malone Jan 31 '14 at 11:35

source share

hvd · Accepted Answer · 2014-01-31T20:00:06+0000

You convert double values out of range to unsigned long long . This is not allowed in standard C ++, and Visual C ++ seems to be very bad at SSE2 mode: it leaves a number on the FPU stack, eventually overflowing it, and does the later code that uses FPU, really in an interesting way.

Downsampling

 double d = 1E20; unsigned long long ull[] = { d, d, d, d, d, d, d, d }; if (floor(d) != floor(d)) abort();

This is interrupted if ull has eight or more elements, but passes if it has up to seven.

The solution is not to convert floating point values to an integer type unless you know that the value is in a range.

4.9 Transforms with a floating integral [conv.fpint]
The value of a floating point type value can be converted to an integer type prvalue. The conversion truncates; those. the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the target type. [Note: if destination type is bool , see 4.12. - final note]

A rule that transfers values outside the range when converting to an unsigned type applies only if the value already has some integer type.

However, despite this, this does not seem intentional, therefore, despite the fact that the standard allows this behavior, it can still report this as an error.

Unusual floating point behavior and no extra variables, why?

Update:

More articles: