Some questions about floating points

I am wondering if a number is represented in one way in a floating-point representation, whether it will be represented in the same way in a larger representation. That is, if a number has a specific representation as a float , will it have the same representation if this float is passed to double , and then it is still the same as when you click on long double .

I am interested because I am writing a BigInteger implementation and any floating point number that is being passed. I am sending a function that takes a long double to convert it. This leads me to the next question. Obviously, floating points do not always have exact representations, so in my BigInteger class I should try to represent when it is assigned a float. Is it reasonable to try to imagine the same number given by std::cout << std::fixed << someFloat; , even if it’s not the same as the number passed in. Is this the most accurate idea I can get? If yes, then...

What is the best way to extract this value (in a database with some power of 10), for now I just grab it by the string and pass it to my string constructor. This will work, but I cannot help but feel that this is the best way, but of course, take the rest when the division into my base is inaccurate with floats.

Finally, I wonder if the uintmax_t floating point uintmax_t , that is, the type name that will always be the largest floating point type in the system, or not a dot, because long double will always be the largest (even if it matches the double )

Thanks T.

+1
c ++ floating-point floating-accuracy
06 Oct '10 at 16:10
source share
3 answers

If by "the same representation" you mean "exactly the same binary representation in memory, except filling", then no. Double precision has more bits of both the exponent and the mantissa, and also has a different bias of the exponent. But I believe that any value of one precision is accurately represented in double precision (except, possibly, denormalized values).

I'm not sure what you mean when you say that "floating points do not always have exact representations." Of course, not all floating point decimal values ​​have exact binary floating point values ​​(and vice versa), but I'm not sure what the problem is here. As long as your floating point input has no fractional part, then a suitable BigInteger format should be able to represent it accurately.

Converting through a base-10 view is not the way to go. Theoretically, all you need is a ~ 1024 bit array that initializes everything to zero, and then shift the mantissa bits to the exponent value. But, not knowing more about your implementation, I can not offer much more!

+9
Oct 06 '10 at 16:18
source share
β€” -

double includes all float values; long double contains all double values. Thus, you do not lose value information by converting to long double . However, you are losing information about the original type, which matters (see below).

To follow the general semantics of C ++, converting a floating point value to integer should truncate the value, not the round.

The main problem is that large values ​​are not accurate. You can use the frexp function to find the exponent of base 2 floating point values. You can use std::numeric_limits<T>::digits to check that it is within an integer range that can be accurately represented.

My personal design choice will be to argue that the fp value is within a range that can be accurately represented, i.e. limits on the range of any actual argument.

To do this, you will need overloads with the arguments float and double , since the range that can be represented exactly depends on the actual type of the argument.

If you have an fp value that is within the acceptable range, you can use floor and fmod to extract the numbers in any system you need.

+2
06 Oct '10 at 18:27
source share

yes, switching from IEEE float to double, to increase, you will see bits from a smaller format to a larger format, for example

 single
 S EEEEEEEE MMMMMMM .....
 double 
 S EEEEEEEEEEEE MMMMM ....

 6.5 single
 0 10000001 101000 ...
 6.5 double
 0 10000000001 101000 ...
 13 single
 0 10000010 101000 ...
 13 double
 0 10000000010 101000 ...

You leave Mantissa an excuse, and then add zeros.

The exponent is right-aligned, the character expands next to msbit, and then copies msbit.

An indicator, for example, -2. take -2 subtract 1, which is -3. -3 in double complement 0xFD or 0b11111101, but the exponent bits in the format: 0b01111101, msbit is inverted. And for the double indicator -2 -2 -2-1-1 = -3. or 0b1111 ... 1101, and it becomes 0b0111 ... 1101, msbit is inverted. (exponent bits = twos_complement (exponent-1) with inverted msbit).

As we see above, the exponent 3 3-1 = 2 0b000 ... 010 inverts the upper bit 0b100 ... 010

So yes, you can take the bits with a single precision and copy them to the appropriate places in the double-precision number. I don't have an extended reference to float, but I'm sure it works the same.

0
Oct 08 '10 at 1:10
source share



All Articles