How to normalize a floating point value in C ++?

Perhaps I am not very well versed in the IEEE754 standard, but given a set of floating point values ​​that are float or double , for example:

 56.543f 3238.124124f 121.3f ... 

you can convert them to values ​​from 0 to 1 , so you normalize them using the appropriate total coefficient when considering what is the maximum value and the minimum value in the set.

Now I want to say that in this transformation I need much higher accuracy for a set of recipients that varies from 0 to 1 compared to the level of accuracy that I need in the first, especially if the values ​​in the first set cover a wide range of numerical values ​​(really large and very small values).

How can a float or double type (or an IEEE 754 type, if you want) handle this situation, while providing higher accuracy for the second set of values, knowing that I basically don't need the whole part?

Or is this not being processed at all, and do I need fixed-point math with a completely different type?

+7
c ++ double floating-point ieee-754
source share
5 answers

Floating point numbers are stored in a format similar to scientific notation. Inside, they align the leading 1 binary representation at the top of the value. Each value is carried over with the same number of precision binary digits relative to its own value.

When you compress your set of floating point values ​​to the range 0..1, the only loss of accuracy you will get is due to rounding that occurs at different stages of the process.

If you simply compress with scaling, you will lose only a little accuracy near the LSB mantissa (about 1 or 2 ulp, where ulp means "last place units").

If you also need to move your data, then things get more complicated. If your data is all positive, then subtracting the smallest number will not hurt anything. But if your data is a mixture of positive and negative data, then some of your values ​​near zero may lose accuracy.

If you perform all arithmetic with double precision, you will carry 53 bits of precision in the calculation. If your accuracy should correspond to this (which, probably, they are), then everything will be all right with you. Otherwise, the exact numerical performance will depend on the distribution of your data.

+5
source share

IEEE single and double floats have a format in which parts of the exponent and fractions have a fixed bit width. Thus, this is not possible (i.e. you will always have unused bits if you only store values ​​from 0 to 1). (See: http://en.wikipedia.org/wiki/Single-precision_floating-point_format )

Are you sure that the 52-bit fractal part of the double is not accurate enough?

Edit: If you use the full range of the floating format, you will lose accuracy when normalizing values. Rounding can be turned off, and sufficiently small values ​​will become 0. If you do not know what the problem is, do not worry. Otherwise, you need to find another solution, as indicated in other answers.

+3
source share

For higher accuracy, you can try http://www.boost.org/doc/libs/1_55_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats.html .

We also note that for numerical critical operations + there are special algorithms that minimize the numerical error introduced by the algorithm:

http://en.wikipedia.org/wiki/Kahan_summation_algorithm

+2
source share

If you have a double choice and you normalize them between 0.0 and 1.0 , there are a number of sources of accuracy loss. However, they are all much smaller than you suspect.

First, you will lose some precision in the arithmetic operations necessary to normalize them as you round. This is relatively small - a bit or so per operation - and usually relatively random.

Secondly, the exponent component will no longer use the opportunity of a positive exponent.

Thirdly, since all values ​​are positive, the sign bit will also be wasted.

Forth, if the input space does not contain + inf or -inf or + NaN or -NaN or the like, these code points will also be wasted.

But, for the most part, you will spend about 3 bits of information in a 64-bit double in your normalization, one of which is almost inevitable when you are dealing with finite bit-widths.

Any 64-bit representation of fixed points of values ​​from 0 to 1 will have a much smaller "range" than double s. A double can represent something of the order of 10^-300 , while a 64-bit fixed-point representation including 1.0 can only take the value 10^-19 or so. (A 64-bit fixed-point representation can represent 1 - 10^-19 as different from 1 , whereas a double cannot, but a 64-bit fixed-point value cannot represent anything less than 2^-64 , whereas a double can) .

Some of the numbers above are approximate and may depend on rounding / exact format.

+2
source share

The presence of binary floating point values ​​(with an implicit leading), expressed as

 (1+fraction) * 2^exponent where fraction < 1 

Division a / b:

 a/b = (1+fraction(a)) / (1+fraction(b)) * 2^(exponent(a) - exponent(b)) 

Therefore, division / multiplication practically does not lose accuracy.

Subtraction ab:

 ab = (1+fraction(a)) * 2^(exponent(a) - (1+fraction(b)) * exponent(b)) 

Therefore, subtraction / addition may have a loss of precision (large - small = large)!

Freezing the value of x in the range [min, max] to [0, 1]

 (x - min) / (max - min) 

will have problems with accuracy if any subtraction loses accuracy.

Answering your question: Nothing, select the appropriate representation (floating point, fraction, multithreading ...) for your algorithms and expected data.

+2
source share

All Articles