What is the risk of numerical instabilities when compiling denominators?

Suppose I want to divide one number into many.

a /= x; b /= x; c /= x; ... 

Because multiplication is faster, the temptation is to do it

 tmp = 1.0f / x; a *= tmp; b *= tmp; c *= tmp; ... 

1) Is it guaranteed to get the same answers? I suspect not, but the confirmation will be pleasant.

2) If x extremely large or extremely small, I expect this to result in a significant loss of accuracy. Is there a formula that will tell me how much I will sacrifice?

3) Perhaps there is no convenient formula, but can we at least formulate an empirical rule when numerical instabilities are a problem? This is due to the values โ€‹โ€‹of the operands or the difference between the values โ€‹โ€‹of the operands, perhaps?

+8
c ieee-754 numeric
source share
4 answers

1) No, it is not guaranteed to receive identical answers. Even with IEEE, subtle rounding effects can lead to a non-1 or 2 ULP using a/x or a*(1/x) .

2) If x extremely small (this is slightly smaller than DBL_MIN (minimum normalized positive floating point), as in the case of sub-normals), 1/x is INF with a general loss of precision. A potentially significant loss of accuracy also occurs when x very large, for example, when the FP model does not support sub-normals.
By testing |x| Against the largest finite number <= 1/DBL_MIN and the smallest nonzero >= 1/DBL_MAX code can determine when a significant loss of accuracy begins. The formula is likely to depend on the FP model used and the exponent x , as well as the limitations of the model. In this binary64 range , the difference in the binary exponent of x and Emin (or Emax ) would be a first-order estimate of the lost bits.

3) Significant numerical instabilities occur in the ranges discussed above.

+5
source share

When you work with ieee-754 numbers, I would say that your approach is quite acceptable. In ieee-754 x approximately

mantisa * 2 indicator

Where mantis is a number from 1/2 to 1.

So, while you only do multiplication and division, you will certainly lose accuracy, but this loss does not depend on the value of x (*) and is associated only with the precision of the floating type used (single, double quad quad value, which means float , double or long double depending on the compiler and architecture).

(*) This is only true if you do not have overflow of the lower stream, i.e. about 10 38 for a single precision of 10 300 for a double precision.

Reviews: IEEE Floating Point and IEEE Floating Point pages on wikipedia

+2
source share

Here are some considerations with supporting links:

1) - Get the same results? No warranties. There are too many contributions to variability, everything from uP design (remember the math 486 DX math co-processor co-processor design error ?) To the compiler implementation, to the way the float is stored in hardware memory. ( Good discussion here. )

2) - Formula? I do not know about that. And what do you mean by a serious mistake? In any case, you can set expectations about the accuracy you will see:

  • Understanding the various implementations of floating point numbers (reference compares 2)
  • What type of variable is used ( float , double , long double ). ( differences )
  • What architecture are you building on 32 bit or 64 bit, others?

There are many discussions with floating point errors. Here is one

3) There are no real rules of thumb (if the rule of thumb that you have in mind is easy to remember, easy to apply, easy to understand), however, here is a good attempt to answer this question regarding a floating point error

+2
source share

My 50 cents for this:

1) No, explained further in 3.

2) I do not know any formula, so I just skip this one.

3) The rule of the thumb that I know is to try to perform operations between operands with a close order of magnitude.

A practical example:

You want to divide by 1.000.000 the number 63.000.000.

Using the first approach, you will eventually divide 63 * 10 ^ 6 into 1 * 10 ^ 6, which have very close values.

However, if you use the second approach, then

 temp = 1.0f / x; 

Will produce 10 ^ (- 6).

Now the multiplication of 63 * 10 ^ 6 * 10 ^ (- 6) will lead to significant losses in accuracy, because the difference between them is large. The CPU will try to use the exponential notation + fraction of the number 10 ^ 6 to represent the number 10 ^ (- 6) ...

A viable alternative would be for temp to be

 temp = 1 / 1.000; 

And then

 a = a * temp * temp ; 

As the magnitude is closer, there is less likelihood of precession loss.

0
source share

All Articles