Integer and float accuracy

This is more of a quantitative analysis than a programming question, but I believe some of you will be able to answer it.

In total, two floats, is there any loss of accuracy? Why?

In the sum of a float and an integer, is any precision lost? Why?

Thanks.

+4
source share
9 answers

In total, two floats, is there any loss of accuracy?

If both floats are of different sizes, and both use the full range of accuracy (about 7 decimal digits), then yes, you will see the loss in the last places.

Why?

This is due to the fact that the floats are stored in the form (sign) (mantissa) Γ— 2 (exponent) . If two values ​​have different indicators and you add them, then a lower value will be reduced to smaller digits in the mantissa (because it must adapt to a larger exponent):

PS> [float]([float]0.0000001 + [float]1) 1 

In the sum of float and integer, is there any loss of accuracy?

Yes, a normal 32-bit integer can accurately represent values ​​that don't exactly fit into the float. The float can still store about the same number, but not exactly. Of course, this applies only to sufficiently large numbers, i.e. e. more than 24 bits.

Why?

Since the float has 24 precision bits and (32-bit) 32 integers. The float will still be able to store the value and most significant digits, but the last places may probably differ:

 PS> [float]2100000050 + [float]100 2100000100 
+7
source

When adding two floating point numbers, as a rule, there is some error. D. Goldberg "What every computer scientist needs to know about floating point arithmetic" describes in detail the effect and causes, as well as how to calculate the upper bound of the error and how to reason about the accuracy of more complex calculations.

When adding a float to an integer, the integer is first converted to float using C ++, so two floats are added and errors are entered for the same reasons as above.

+2
source

The accuracy depends on the value of the original numbers. At a floating point, the computer is number 312 as a scientific designation:

 3.12000000000 * 10 ^ 2 

The decimal numbers on the left side (mantissa) are fixed. The indicator also has an upper and lower bound. This allows you to display very large or very small numbers.

If you try to add two numbers that are the same in magnitude, the result should remain unchanged in accuracy, since the decimal point should not move:

 312.0 + 643.0 <==> 3.12000000000 * 10 ^ 2 + 6.43000000000 * 10 ^ 2 ----------------------- 9.55000000000 * 10 ^ 2 

If you try to add a very large and very small number, you will lose accuracy because they must be compressed into the above format. Consider 312 + 12300000000000000000000. First you need to scale a smaller number to line up with a larger one, and then add:

 1.23000000000 * 10 ^ 15 + 0.00000000003 * 10 ^ 15 ----------------------- 1.23000000003 <-- precision lost here! 

A floating point can handle very large or very small numbers. But he cannot represent both at the same time.

As for adding int and doubling, int immediately turns into double, then the above applies.

+2
source

The accuracy available for float is limited, so of course there is always the risk that any given operation will reduce accuracy.

The answer to both questions is yes.

If you try to add a very large float to a very small one, for example, you will have problems.

Or, if you try to add an integer to the float, where the integer uses more bits than the float for its mantissa.

+1
source

The short answer is: a computer is a float with a limited number of bits, which is often done using the mantissa and exponent , so only a few bytes are used for significant digits, and the rest are used to represent the position of the decimal point.

If you tried to add (say) 10 ^ 23 and 7, then he would not be able to accurately represent this result. A similar argument is used when adding float and integer - the integer will increase to float.

+1
source

In total, two floats, is there any loss of accuracy? In the sum of the float and the integer, is there any kind of accuracy lost? Why?

Not always. If the amount is presented with the accuracy you specified, and you do not receive a loss of accuracy.

Example: 0.5 + 0.75 => without loss of accuracy x * 0.5 => no loss of accuracy (except when x is too small)

In the general case, one adds floats in slightly different ranges, so there is a precision loss that actually depends on the rounding mode. i.e.: if you add numbers with completely different ranges, expect exact problems.

Denormals here provide additional accuracy in extreme cases due to the processor.

Depending on how your compiler handles floating point calculations, the results may vary.

With strict IEEE semantics, adding two 32-bit floats should not give higher accuracy than 32 bits. In practice, this may require additional instruction to ensure that you do not have to rely on accurate and repeatable floating point results.

+1
source

In both cases, yes:

 assert( 1E+36f + 1.0f == 1E+36f ); assert( 1E+36f + 1 == 1E+36f ); 
0
source

The case of float + int is the same as float + float, because the standard conversion applies to int. In the case of float + float, it depends on the implementation, because the implementation can choose a complement with double precision. Of course, when saving the result, there may be some loss.

0
source

In both cases, the answer is yes. When you add int to a float integer is converted to a floating-point representation before the addition happens anyway.

To understand why, I suggest you read this stone: What every computer scientist should know about floating point arithmetic .

0
source

All Articles