What happens in the background when converting int to float

Do I have any idea on how to use an int for a float step by step? Suppose I have a signed integer that is in binary format. Moreover, I want to give it a swim by hand. However, I can’t. So, CAn show me how to do this conversion step by step?

I do this conversion to c, many times? like;

int a = foo ( ); float f = ( float ) a ; 

But I did not understand what was happening in the background. In addition, in order to understand well, I want to do this conversion manually.

EDIT: If you know a lot about conversion, you can also provide information that double conversion will be used for float. Moreover, for float to int

+4
c floating-point int
Nov 02 '11 at 8:00
source share
2 answers

Floating-point values ​​(in any case, IEEE754) have three components:

  • s sign;
  • sequence of bits of exponent e ; and
  • row of mantissa bits m .

Accuracy determines how many bits are available for exponent and mantissa. Consider a value of 0.1 for single precision floating point:

 s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n 0 01111011 10011001100110011001101 ||||||||||||||||||||||+- 8388608 |||||||||||||||||||||+-- 4194304 ||||||||||||||||||||+--- 2097152 |||||||||||||||||||+---- 1048576 ||||||||||||||||||+----- 524288 |||||||||||||||||+------ 262144 ||||||||||||||||+------- 131072 |||||||||||||||+-------- 65536 ||||||||||||||+--------- 32768 |||||||||||||+---------- 16384 ||||||||||||+----------- 8192 |||||||||||+------------ 4096 ||||||||||+------------- 2048 |||||||||+-------------- 1024 ||||||||+--------------- 512 |||||||+---------------- 256 ||||||+----------------- 128 |||||+------------------ 64 ||||+------------------- 32 |||+-------------------- 16 ||+--------------------- 8 |+---------------------- 4 +----------------------- 2 

The sign is positive, it is quite easy.

The exponent 64+32+16+8+2+1 = 123 - 127 bias = -4 , so the factor is 2 -4 or 1/16 . The bias is there so you can get really small numbers (like 10 -30 ) as well as big ones.

Mantissa is short. It consists of 1 (implicit base) plus (for all these bits with each cost 1 / (2 n ), since n starts with 1 and increases to the right), {1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608} .

When you add all this, you get 1.60000002384185791015625 .

When you multiply this by a factor of 2 -4 you get 0.100000001490116119384765625 , so they say that you cannot represent 0.1 exactly as an IEEE754 float.

In terms of converting integers to float, if you have so many bits in the mantissa (including implicit 1), you can simply transfer the integer bit pattern and select the correct metric. There will be no loss of accuracy. For example, the double precision of IEEE754 (64 bits, 52/53 of those that are mantissa) does not have problems with a 32-bit integer.

If your integers have more bits (for example, a 32-bit integer and a 32-bit single precision limit, which has only 23/24 bits of the mantissa), you need to scale the integer.

This includes removing the least significant bits (actually rounding) so that it fits into the mantissa bits. This implies a loss of accuracy, but it is inevitable.




Let's look at the specific value 123456789 . The following program downloads a bit of each data type.

 #include <stdio.h> static void dumpBits (char *desc, unsigned char *addr, size_t sz) { unsigned char mask; printf ("%s:\n ", desc); while (sz-- != 0) { putchar (' '); for (mask = 0x80; mask > 0; mask >>= 1, addr++) if (((addr[sz]) & mask) == 0) putchar ('0'); else putchar ('1'); } putchar ('\n'); } int main (void) { int intNum = 123456789; float fltNum = intNum; double dblNum = intNum; printf ("%d %f %f\n",intNum, fltNum, dblNum); dumpBits ("Integer", (unsigned char *)(&intNum), sizeof (int)); dumpBits ("Float", (unsigned char *)(&fltNum), sizeof (float)); dumpBits ("Double", (unsigned char *)(&dblNum), sizeof (double)); return 0; } 

The output on my system is as follows:

 123456789 123456792.000000 123456789.000000 integer: 00000111 01011011 11001101 00010101 float: 01001100 11101011 01111001 10100011 double: 01000001 10011101 01101111 00110100 01010100 00000000 00000000 00000000 

And we will look at them one at a time. First the whole, two simple degrees:

  00000111 01011011 11001101 00010101 ||| | || || || || | | | +-> 1 ||| | || || || || | | +---> 4 ||| | || || || || | +-----> 16 ||| | || || || || +----------> 256 ||| | || || || |+------------> 1024 ||| | || || || +-------------> 2048 ||| | || || |+----------------> 16384 ||| | || || +-----------------> 32768 ||| | || |+-------------------> 65536 ||| | || +--------------------> 131072 ||| | |+----------------------> 524288 ||| | +-----------------------> 1048576 ||| +-------------------------> 4194304 ||+----------------------------> 16777216 |+-----------------------------> 33554432 +------------------------------> 67108864 ========== 123456789 

Now let's look at a single precision float. Notice the mantissa bitmap matching the integer as an almost perfect match:

 mantissa: 11 01011011 11001101 00011 (spaced out). integer: 00000111 01011011 11001101 00010101 (untouched). 

There is an implicit bit 1 left of the mantissa, and it is also rounded at the other end, from where this loss of accuracy occurs (the value changes from 123456789 to 123456792 , as the output from this program above).

Development of values:

 s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n 0 10011001 11010110111100110100011 || | || |||| || | |+- 8388608 || | || |||| || | +-- 4194304 || | || |||| || +------ 262144 || | || |||| |+-------- 65536 || | || |||| +--------- 32768 || | || |||+------------ 4096 || | || ||+------------- 2048 || | || |+-------------- 1024 || | || +--------------- 512 || | |+----------------- 128 || | +------------------ 64 || +-------------------- 16 |+---------------------- 4 +----------------------- 2 

The sign is positive. The exponent 128+16+8+1 = 153 - 127 bias = 26 , so the factor is 2 26 or 67108864 .

Mantissa 1 (implicit basis) plus (as explained above), {1/2, 1/4, 1/16, 1/64, 1/128, 1/512, 1/1024, 1/2048, 1/4096, 1/32768, 1/65536, 1/262144, 1/4194304, 1/8388608} . When you add all this, you will get 1.83964955806732177734375 .

When you multiply by a factor of 2 26 you get 123456792 , the same as the program output.

Double bitmask output:

 s eeeeeeeeeee mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm 0 10000011001 1101011011110011010001010100000000000000000000000000 

I'm not going to go through the process of figuring out the value of this beast :-) However, I will show the mantissa next to the integer format to show a general representation of the bits:

 mantissa: 11 01011011 11001101 00010101 000...000 (spaced out). integer: 00000111 01011011 11001101 00010101 (untouched). 

You can again see the commonality with the implicit bit on the left and the significantly greater availability of bits on the right, so in this case there is no loss of accuracy.




In terms of conversion between floats and doublings, which are also fairly easy to understand.

First you need to check for special values ​​such as NaN and infinity. They are indicated by special exponent / mantissa combinations, and it is probably easier to detect that the front angles generate the equivalent in the new format.

Then, in the case when you go from double to float, you obviously have less range available to you, since there are less bits in the exponent. If your double is outside the range of float, you need to handle this.

Assuming it works, you need to:

  • reformat the exponent (offset is different for two types).
  • copy as many bits from the mantissa as possible (round if necessary).
  • deletes the rest of the mantissa (if any) with zero bits.
+11
Nov 02 '11 at 8:15
source share

Finally, it's pretty simple. A float (in IEEE 754-1985) has the following representation:

  • 1-bit character
  • A value of 8 bits (0 means denormalized numbers, 1 means -126, 127 means 0, 255 means infinity).
  • 23 bit mantissa (the part that follows "1.")

So basically this is about:

  • determine the sign and value of a number
  • find the 24 most significant bits correctly rounded
  • customize exhibitor
  • encode these three parts in 32-bit form

When implementing your own conversion, it’s easy to check, because you can simply compare the results with the built-in type conversion operator.

+1
Nov 02 '11 at 8:20
source share



All Articles