How to implement floating point division in binary format without hardware without floating point equipment

Question

How to implement floating point division in binary format without hardware without floating point equipment

I am wondering how to implement IEEE-754 single-precision 32-bit floating point division in binary format without separation equipment and without floating point equipment?

I have hardware, addition, subtraction and multiplication.

I have already implemented multiplication, addition and subtraction with floating point using 16-bit words.

I follow these instructions on a proprietary multi-core processor and write my code to the assembly. In advance, I use matlab to test my algorithm.

I know that I need to subtract the exponents, but how can I perform unsigned division into mantissas?

+2

assembly floating-point binary matlab division

Veridian Jan 25 '12 at 22:12

source share

2 answers

If your equipment has a sufficiently fast integer factor (say 4-5 cycles), using an iterative approach to calculating recip = 1 / divisor is likely to be the fastest approach. Then you would calculate the factor as the recipient of dividends *. This is very useful for necessary fixed-point calculations if the hardware offers either integer multiplication with a double-width result (i.e., Full product), or the "mulhi" instruction, which delivers the top 32 bits of the product from two 32-bit integers.

You will need to re-scale the fixed-point calculations on the fly in order to keep the intermediate results at 32 bits in order to get a result that matches the 24 bits that are needed for the final result after rounding.

The fastest approach, most likely, creates a 9-bit initial approximation "approximate" to 1 / divisor, followed by cubically convergent iteration for the opposite:

e = 1 - divisor * approx e = e * e + e recip = e * approx + approx

The easiest way is to compile the initial approximation and store it in an array of 256 bytes, indexed by bits 23:16 of the divider (i.e. the 8 most significant fractional bits of the mantissa). Since all approximation values are numbers in the range 0x100 ... 0x1FF (corresponding to 0.5-0.998046875), there is no need to store the most significant bit of each value, since it will be "1". Just add 0x100 to the table element obtained to get the final value of the initial approximation.

If you cannot afford 256 bytes of table storage, an alternative way to generate an initial approximation with an accuracy of 9 bits will be a polynomial of degree 3, which approximates 1 / (1 + f), where f is the fractional part of the mantissa divisor, m. Since with IEEE- 754 it is known that m is in [1.0.2.0], f is in [0.0.1.0].

Correct rounding to the nearest or even (if necessary) can be realized by multiplying the preliminary ratio by the divisor to determine the remainder and choosing the final quotient so that the remainder is minimized.

The following code demonstrates the approximation principles discussed above using the simpler case of the opposite, with proper rounding in accordance with the nearest or even IEEE-754 mode and with the corresponding handling of special cases (zero, denormal, infinity, NaNs). It is contemplated that the 32-bit IEEE-754 single-point float can be transmitted in bits from and to a 32-bit unsigned int. Then all operations are performed with 32-bit integers.

 unsigned int frcp_rn_core (unsigned int z) { unsigned int x, y; int sign; int expo; sign = z & 0x80000000; expo = (z >> 23) & 0xff; x = expo - 1; if (x > 0xfd) { /* handle special cases */ x = z << 1; /* zero or small denormal returns infinity of like sign */ if (x <= 0x00400000) { return sign | 0x7f800000; } /* infinity returns zero of like sign */ else if (x == 0xff000000) { return sign; } /* convert SNaNs to QNaNs */ else if (x > 0xff000000) { return z | 0x00400000; } /* large denormal, normalize it */ else { y = x < 0x00800000; z = x << y; expo = expo - y; } } y = z & 0x007fffff; #if USE_TABLE z = 256 + rcp_tab[y >> 15]; /* approx */ #else x = y << 3; z = 0xe39ad7a0; z = mul_32_hi (x, z) + 0x0154bde4; z = mul_32_hi (x, z) + 0xfff87521; z = mul_32_hi (x, z) + 0x00001ffa; z = z >> 4; #endif /* USE_TABLE */ y = y | 0x00800000; /* cubically convergent approximation to reciprocal */ x = (unsigned int)-(int)(y * z); /* x = 1 - arg * approx */ x = mul_32_hi (x, x) + x; /* x = x * x + x */ z = z << 15; z = mul_32_hi (x, z) + z; /* approx = x * approx + approx */ /* compute result exponent */ expo = 252 - expo; if (expo >= 0) { /* result is a normal */ x = y * z + y; z = (expo << 23) + z; z = z | sign; /* round result */ if ((int)x <= (int)(y >> 1)) { z++; } return z; } /* result is a denormal */ expo = -expo; z = z >> expo; x = y * z + y; z = z | sign; /* round result */ if ((int)x <= (int)(y >> 1)) { z++; } return z; }

The mul_32_high () function is a placeholder for a specific machine that returns the top 32 bits of a signed multiplication of two 32-bit bit integers. Instead of a portable implementation, instead of a specific machine version

 /* 32-bit int, 64-bit long long int */ int mul_32_hi (int a, int b) { return (int)(unsigned int)(((unsigned long long)(((long long)a)*b)) >> 32); }

The reciprocal table of 256 records used by the table variant can be constructed as follows:

 static unsigned char rcp_tab[256]; for (int i = 0; i < 256; i++) { rcp_tab[i] = (unsigned char)(((1./(1.+((i+.5)/256.)))*512.+.5)-256.); }

+2

njuffa Jan 28 '12 at 23:01

source share

Trevor tippins · Accepted Answer · 2012-01-25T22:29:19+0000

Depends on how difficult it is to do. Keeping this simple enough, you can try division by mutual approximation.

Instead of calculating: (n / d) you would develop: n * (1 / d).

To do this, you will need to develop a response method, using, for example, Newton-Raphson , which uses the Newton method to compute successively more accurate estimates of the reciprocity of the divisor until it is “adequately” accurate for your purpose before you make the final multiplication step .

EDIT

Just looked at your update. It may or may not be useful to you in the end!

How to implement floating point division in binary format without hardware without floating point equipment

More articles: