If your equipment has a sufficiently fast integer factor (say 4-5 cycles), using an iterative approach to calculating recip = 1 / divisor is likely to be the fastest approach. Then you would calculate the factor as the recipient of dividends *. This is very useful for necessary fixed-point calculations if the hardware offers either integer multiplication with a double-width result (i.e., Full product), or the "mulhi" instruction, which delivers the top 32 bits of the product from two 32-bit integers.
You will need to re-scale the fixed-point calculations on the fly in order to keep the intermediate results at 32 bits in order to get a result that matches the 24 bits that are needed for the final result after rounding.
The fastest approach, most likely, creates a 9-bit initial approximation "approximate" to 1 / divisor, followed by cubically convergent iteration for the opposite:
e = 1 - divisor * approx e = e * e + e recip = e * approx + approx
The easiest way is to compile the initial approximation and store it in an array of 256 bytes, indexed by bits 23:16 of the divider (i.e. the 8 most significant fractional bits of the mantissa). Since all approximation values โโare numbers in the range 0x100 ... 0x1FF (corresponding to 0.5-0.998046875), there is no need to store the most significant bit of each value, since it will be "1". Just add 0x100 to the table element obtained to get the final value of the initial approximation.
If you cannot afford 256 bytes of table storage, an alternative way to generate an initial approximation with an accuracy of 9 bits will be a polynomial of degree 3, which approximates 1 / (1 + f), where f is the fractional part of the mantissa divisor, m. Since with IEEE- 754 it is known that m is in [1.0.2.0], f is in [0.0.1.0].
Correct rounding to the nearest or even (if necessary) can be realized by multiplying the preliminary ratio by the divisor to determine the remainder and choosing the final quotient so that the remainder is minimized.
The following code demonstrates the approximation principles discussed above using the simpler case of the opposite, with proper rounding in accordance with the nearest or even IEEE-754 mode and with the corresponding handling of special cases (zero, denormal, infinity, NaNs). It is contemplated that the 32-bit IEEE-754 single-point float can be transmitted in bits from and to a 32-bit unsigned int. Then all operations are performed with 32-bit integers.
unsigned int frcp_rn_core (unsigned int z) { unsigned int x, y; int sign; int expo; sign = z & 0x80000000; expo = (z >> 23) & 0xff; x = expo - 1; if (x > 0xfd) { x = z << 1; if (x <= 0x00400000) { return sign | 0x7f800000; } else if (x == 0xff000000) { return sign; } else if (x > 0xff000000) { return z | 0x00400000; } else { y = x < 0x00800000; z = x << y; expo = expo - y; } } y = z & 0x007fffff; #if USE_TABLE z = 256 + rcp_tab[y >> 15]; #else x = y << 3; z = 0xe39ad7a0; z = mul_32_hi (x, z) + 0x0154bde4; z = mul_32_hi (x, z) + 0xfff87521; z = mul_32_hi (x, z) + 0x00001ffa; z = z >> 4; #endif /* USE_TABLE */ y = y | 0x00800000; /* cubically convergent approximation to reciprocal */ x = (unsigned int)-(int)(y * z); /* x = 1 - arg * approx */ x = mul_32_hi (x, x) + x; /* x = x * x + x */ z = z << 15; z = mul_32_hi (x, z) + z; /* approx = x * approx + approx */ /* compute result exponent */ expo = 252 - expo; if (expo >= 0) { /* result is a normal */ x = y * z + y; z = (expo << 23) + z; z = z | sign; /* round result */ if ((int)x <= (int)(y >> 1)) { z++; } return z; } /* result is a denormal */ expo = -expo; z = z >> expo; x = y * z + y; z = z | sign; /* round result */ if ((int)x <= (int)(y >> 1)) { z++; } return z; }
The mul_32_high () function is a placeholder for a specific machine that returns the top 32 bits of a signed multiplication of two 32-bit bit integers. Instead of a portable implementation, instead of a specific machine version
int mul_32_hi (int a, int b) { return (int)(unsigned int)(((unsigned long long)(((long long)a)*b)) >> 32); }
The reciprocal table of 256 records used by the table variant can be constructed as follows:
static unsigned char rcp_tab[256]; for (int i = 0; i < 256; i++) { rcp_tab[i] = (unsigned char)(((1./(1.+((i+.5)/256.)))*512.+.5)-256.); }