Convert double to float without using FPU rounding mode

Does anyone have convenient code snippets for converting IEEE 754 doubleto the immediate lower (upper) one float, without changing or accepting any changes in the current FPU rounding mode ?

Note: this limitation probably implies not using FPU at all. I expect the easiest way to do this under these conditions is to read the double bits in a 64-bit length and work with it.

For simplicity, you can assume the end result of your choice and that this double question is available through the dunion field below:

union double_bits
{
  long i;
  double d;
};

I would try to do this myself, but I am sure that I will present it is difficult to notice errors for denormalized or negative numbers.

+5
source share
3 answers

I think the following works, but first I will state my assumptions:

  • floating point numbers are stored in IEEE-754 format when implemented,
  • No overflow
  • Available nextafterf()(listed on C99).

In addition, most likely, this method is not very effective.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
    /* Change to non-zero for superior, otherwise inferior */
    int superior = 0;

    /* double value to convert */
    double d = 0.1;

    float f;
    double tmp = d;

    if (argc > 1)
        d = strtod(argv[1], NULL);

    /* First, get an approximation of the double value */
    f = d;

    /* Now, convert that back to double */
    tmp = f;

    /* Print the numbers. %a is C99 */
    printf("Double: %.20f (%a)\n", d, d);
    printf("Float: %.20f (%a)\n", f, f);
    printf("tmp: %.20f (%a)\n", tmp, tmp);

    if (superior) {
        /* If we wanted superior, and got a smaller value,
           get the next value */
        if (tmp < d)
            f = nextafterf(f, INFINITY);
    } else {
        if (tmp > d)
            f = nextafterf(f, -INFINITY);
    }
    printf("converted: %.20f (%a)\n", f, f);

    return 0;
}

On my machine, it prints:

Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)

, double float - . double , . , float , float .

+3

, : fooobar.com/questions/14515/... .

    // d is IEEE double, but double is not natively supported.
    static float ConvertDoubleToFloat(void* d)
    {
        unsigned long long x;
        float f; // assumed to be IEEE float
        unsigned long long sign ;
        unsigned long long exponent;
        unsigned long long mantissa;

        memcpy(&x,d,8);

        // IEEE binary64 format (unsupported)
        sign     = (x >> 63) & 1; // 1
        exponent = ((x >> 52) & 0x7FF); // 11
        mantissa = (x >> 0) & 0x000FFFFFFFFFFFFFULL; // 52
        exponent -= 1023;

        // IEEE binary32 format (supported)
        exponent += 127; // rebase
        exponent &= 0xFF;
        mantissa >>= (52-23); // left justify

        x = mantissa | (exponent << 23) | (sign << 31);
        memcpy(&f,&x,4);

        return f;
    }
+2

All Articles