In the hacked vein, IEEE 754 is another solution that is faster and less magical. It reaches the error threshold of 0.8% in about a dozen clock cycles (for the case p = 2.4, on the Intel Merom processor).
Floating-point numbers were originally invented as an approximation to logarithms, so you can use an integer value as an approximation to log2 . This can be made somewhat possible by applying the convert-from-integer command to a floating point value to get a different floating point value.
To complete the calculation of pow , you can multiply by a constant coefficient and convert the logarithm back using the convert-to-integer command. In SSE, the corresponding instructions are cvtdq2ps and cvtps2dq .
It is not so simple. The exponent field in IEEE 754 is signed with an offset value of 127 representing a measure of zero. This bias must be removed before multiplying the logarithm and adding it again before you increase the risk. In addition, subtraction bias correction will not work to zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor in advance.
x^p = exp2( p * log2( x ) ) = exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 ) = cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) ) = cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) ) = cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) ) = cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
exp2( 127 / p - 127 ) is a constant factor. This function is quite specialized: it will not work with small fractional indicators, because the constant coefficient grows exponentially with the inverse of the exponent and will overflow. It will not work with negative indicators. Large exponents result in a high error because the mantissa bits are mixed with the exponent bits by multiplication.
But these are just 4 quick instructions. Pre-multiply, convert from the "whole" (to the logarithm), multiply the power, convert to the "whole" (from the logarithm). Conversions are very fast in this SSE implementation. We can also compress an additional constant coefficient into the first multiplication.
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg\n", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg\n", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg\n", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg\n", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg\n", ret ); return ret; }
Several studies with a score of = 2.4 show that this consistently overestimates by about 5%. (A subroutine always guarantees reevaluation.) You can simply multiply by 0.95, but a few more instructions will lead us to 4 decimal digits of accuracy, which should be enough for graphics.
The key should correspond to revaluation with underestimation and take an average value.
- Calculate x ^ 0.8: four commands, error ~ + 3%.
- Calculate x ^ -0.4: one
rsqrtps . (This is accurate enough, but sacrifices the ability to work with zero.) - Compute x ^ 0.4: one
mulps . - Calculate x ^ -0.2: one
rsqrtps . - Compute x ^ 2: one
mulps . - Compute x ^ 3: one
mulps . - x ^ 2.4 = x ^ 2 * x ^ 0.4: one
mulps . This is an overvaluation. - x ^ 2.4 = x ^ 3 * x ^ -0.4 * x ^ -0.2: two
mulps . This is underestimated. - Average above: one
addps , one mulps .
Team score: fourteen, including two conversions with latency = 5 and two mutual estimates of the square root with throughput = 4.
In order to correctly accept the average value, we want to weigh the estimates based on their expected errors. Underestimation increases the margin of error to 0.6 versus 0.4, so we expect it to be 1.5x as erroneous. Weighing does not add any instructions; this can be done in the prefactor. Call coefficient a: a ^ 0.5 = 1.5 a ^ -0.75 and a = 1.38316186.
The final error is about 0.015%, or 2 orders of magnitude better than the original fastpow result. The runtime is about a dozen cycles for a busy cycle with volatile source and destination variables ... although it overlaps iterations, the level of parallelism at the instruction level will also be displayed in the real world. Given SIMD, this is the throughput of one scalar result in 3 cycles!
int main() { __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 ); std::printf( "Input: %,vg\n", x0 ); // Approx 5% accuracy from one call. Always an overestimate. __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 ); std::printf( "Direct x^2.4: %,vg\n", x1 ); // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 ); std::printf( "1.38 x^0.8: %,vg\n", xf ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( x0, x0 ); __m128 x3 = _mm_mul_ps( x2, x0 ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); std::printf( "x^2 * x^0.4: %,vg\n", x2 ); std::printf( "x^3 / x^0.6: %,vg\n", x3 ); x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) ); // Final accuracy about 0.015%, 200x better than x^0.8 calculation. std::printf( "average = %,vg\n", x2 ); }
Ok ... sorry I couldn't post this before. And expanding it to x ^ 1 / 2.4 is left as an exercise; v).
Statistics update
I implemented a small test wiring and two cases x ( 5/12 ) which correspond above.
#include <cstdio> #include <xmmintrin.h> #include <cmath> #include <cfloat> #include <algorithm> using namespace std; template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg\n", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg\n", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg\n", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg\n", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg\n", ret ); return ret; } __m128 pow125_4( __m128 arg ) { // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( arg, arg ); __m128 x3 = _mm_mul_ps( x2, arg ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) ); } __m128 pow512_2( __m128 arg ) { // 5/12 is too small, so compute the sqrt of 10/12 instead. __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg ); return _mm_mul_ps( _mm_rsqrt_ps( x ), x ); } __m128 pow512_4( __m128 arg ) { // 5/12 is too small, so compute the 4th root of 20/12 instead. // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow. // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3 __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg ); __m128 xover = _mm_mul_ps( arg, xf ); __m128 xfm1 = _mm_rsqrt_ps( xf ); __m128 x2 = _mm_mul_ps( arg, arg ); __m128 xunder = _mm_mul_ps( x2, xfm1 ); // sqrt2 * over + 2 * sqrt2 * under __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ), _mm_add_ps( xover, xunder ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); return xavg; } __m128 mm_succ_ps( __m128 arg ) { return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) ); } void test_pow( double p, __m128 (*f)( __m128 ) ) { __m128 arg; for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON ); ! isfinite( _mm_cvtss_f32( f( arg ) ) ); arg = mm_succ_ps( arg ) ) ; for ( ; _mm_cvtss_f32( f( arg ) ) == 0; arg = mm_succ_ps( arg ) ) ; std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) ); int n; int const bucket_size = 1 << 25; do { float max_error = 0; double total_error = 0, cum_error = 0; for ( n = 0; n != bucket_size; ++ n ) { float result = _mm_cvtss_f32( f( arg ) ); if ( ! isfinite( result ) ) break; float actual = ::powf( _mm_cvtss_f32( arg ), p ); float error = ( result - actual ) / actual; cum_error += error; error = std::abs( error ); max_error = std::max( max_error, error ); total_error += error; arg = mm_succ_ps( arg ); } std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n", max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) ); } while ( n == bucket_size ); } int main() { std::printf( "4 insn x^12/5:\n" ); test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > ); std::printf( "14 insn x^12/5:\n" ); test_pow( 12./5, & pow125_4 ); std::printf( "6 insn x^5/12:\n" ); test_pow( 5./12, & pow512_2 ); std::printf( "14 insn x^5/12:\n" ); test_pow( 5./12, & pow512_4 ); }
Output:
4 insn x^12/5: Domain from 1.36909e-23 error max = inf avg = inf |avg| = inf to 8.97249e-19 error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14 error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09 error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553 error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513 error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06 error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10 error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15 error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16 14 insn x^12/5: Domain from 1.42795e-19 error max = inf avg = nan |avg| = nan to 9.35823e-15 error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10 error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05 error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411 error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629 error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10 error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12 6 insn x^5/12: Domain from 9.86076e-32 error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27 error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22 error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17 error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12 error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07 error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125 error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512 error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07 error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12 error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17 error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21 error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26 error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31 error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36 error max = 0.999329 avg = nan |avg| = nan to -0 14 insn x^5/12: Domain from 2.64698e-23 error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18 error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13 error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09 error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281 error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32 error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06 error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11 error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15 error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
I suspect that the accuracy of the more accurate 5/12 is limited to the rsqrt operation.