How to convert float to double (both stored in IEEE-754 view) without loss of precision?

I mean, for example, I have the following number encoded in IEEE-754 single precision:

"0100 0001 1011 1110 1100 1100 1100 1100" (approximately 23.85 in decimal) 

The binary number above is stored in a literal string.

The question is, how can I convert this string to an IEEE-754 double-precision representation (somewhat similar to the following, but the value is not the same) WITHOUT loss of precision?

 "0100 0000 0011 0111 1101 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010" 

which is the same number encoded in double precision by IEEE-754.

I tried using the following algorithm to convert the first string to a decimal number first, but it loses accuracy.

 num in decimal = (sign) * (1 + frac * 2^(-23)) * 2^(exp - 127) 

I am using the Qt C ++ Framework on a Windows platform.

EDITOR: I have to apologize, maybe I didn’t get a clear question. I mean, I do not know the true value of 23.85, I only have the first line, and I want to convert it to a double-precision representation without losing accuracy.

+7
source share
5 answers

Good: keep a bit of the sign, rewrite the exponent (minus the old bias, plus the new bias), and put the mantis with zeros to the right ...

(As @Mark says, you need to consider individual cases separately, namely when the biased metric is zero or maximum.)

+3
source

First of all, +1 to identify the input in binary format.

Secondly, this number is not equal to 23.85, but a little less. If you flip your last binary digit from 0 to 1 , the number will still not represent exactly 23.85, but a bit more. These differences cannot be adequately recorded in the float, but they can be approximately captured in double size.

Thirdly, what you think you are losing is called accuracy, not accuracy. The accuracy of a number always grows by converting from one precision to double precision, while accuracy can never be improved by conversion (your inaccurate number remains inaccurate, but the extra precision makes it more obvious).

I recommend converting to float or rounding or adding a very small value just before the number is displayed (or logged), because the visual appearance is what you really lost, increasing accuracy.

Resist the temptation to round right after the throw and use the rounded value in subsequent calculations - this is especially dangerous for loops. Although it may seem that fixing the problem in the debugger, additional additional inaccuracies can further distort the final result.

+2
source

IEEE-754 (and floating point in general) cannot represent periodic binary decimal places with full precision. Even when they, in fact, are rational numbers with a relatively small integer numerator and denominator. Some languages ​​provide a rational type that can do this (these are languages ​​that also support unlimited precision chains).

As a result, the two numbers you sent do NOT match.

They actually:

10111.110110011001100110000000000000000000000000000000000000000000 ... 10111.11011001100110011001100110011001100110011001101000000000 ...

where ... represents an infinite sequence 0 s.

Stephen Canon in the above comment gives you the corresponding decimal values ​​(I did not check them, but I have no reason to doubt that he understood them correctly).

Therefore, the transformation you want to do cannot be performed because the number of one precision does not have the necessary information (you have NO way to find out if the number is really periodic or just looks like it exists, repetition).

+2
source

It may be easiest to convert a string to an actual float, convert it to double, and convert back to string.

+1
source

Binary floating points cannot, in general, accurately represent decimal places. Converting from a decimal fractional value to binary floating point (see "Bellerophon" in "How to read floating point numbers exactly" by William D. Klinger) and from binary floating point to decimal value (see "Dragon4" in "How print floating-point numbers "Guy L.Steele Jr. and Jon L.White) give the expected results, because one converts the decimal number to the nearest representable binary floating-point, and the other manages the error to find out which (both algorithms are improved and more practical in David Gay dtoa.c. The algorithms are about again to restore std::numeric_limits<T>::digits10 decimal digits (except possibly trailing zeros) from a floating point value stored in type T

Unfortunately, expanding a float to a double destroys the chaos on this value: in many cases, trying to format a new number will not give a decimal original, because a float supplemented with zeros is different from the nearest double Bellerophon will create and, thus, expect Dragon4. However, basically there are two approaches that work quite well:

  • As soon as someone suggested converting float to string and this string to double . This is not particularly effective, but it can be proved that they give the correct results (provided that the not entirely trivial algorithms are correctly implemented, of course).
  • Assuming your value is in a reasonable range, you can multiply it by a power of 10, so that the least significant decimal digit is non-zero, converts this number to an integer, this integer to a double and finally divide the resulting double into the original power 10. I have no evidence that this gives the correct number, but for the range of values ​​that interests me and which I want to store exactly in the float , this works.

One sensible approach to avoid this complete problem is to use decimal floating point values ​​as described for C ++ in Decimal TR . Unfortunately, this is not yet part of the standard, but I submitted a proposal to the C ++ standardization committee to change this.

-one
source

All Articles