Optimizations for pow () with constant non-integer exponent?

Question

Optimizations for pow () with constant non-integer exponent?

I have hot spots in my code where I do pow() , taking about 10-20% of the execution time.

My contribution to pow(x,y) very specific, so I wonder if there is a way to flip two pow() approximations (one for each metric) with higher performance:

I have two constant indicators: 2.4 and 1 / 2.4.
When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
When the exponent is 1 / 2.4, x will be in the range (0.0031308, 1.0).
I use SSE / AVX float vectors. If platform features can be used, right on!

The maximum error rate of about 0.01% is ideal, although I am also interested in full accuracy algorithms (for float ).

I already use pow() fast approximation but it does not take these restrictions into account. Is it possible to do better?

+57

c ++ optimization c math exponent

Cory Nelson Jun 25 '11 at 2:17

source share

10 answers

Another answer, because it is very different from my previous answer, and it flashes quickly. The relative error is 3e-8. Want more accuracy? Add a few more Chebychev terms. It is better to keep the order odd, as this leads to a small gap between the 2 ^ n-epsilon and 2 ^ n + epsilon.

 #include <stdlib.h> #include <math.h> // Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error). // Want more precision? Add more Chebychev polynomial coefs. double pow512norm ( double x) { static const int N = 8; // Chebychev polynomial terms. // Non-zero terms calculated via // integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12) // from -1 to 1 // Zeroth term is similar except it uses 1/pi rather than 2/pi. static const double Cn[N] = { 1.1758200232996901923, 0.16665763094889061230, -0.0083154894939042125035, 0.00075187976780420279038, // Wolfram alpha doesn't want to compute the remaining terms // to more precision (it times out). -0.0000832402, 0.0000102292, -1.3401e-6, 1.83334e-7}; double Tn[N]; double u = 2.0*x - 3.0; Tn[0] = 1.0; Tn[1] = u; for (int ii = 2; ii < N; ++ii) { Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2]; } double y = 0.0; for (int ii = N-1; ii >= 0; --ii) { y += Cn[ii]*Tn[ii]; } return y; } // Returns x^(5/12) to within 3e-8 (relative error). double pow512 ( double x) { static const double pow2_512[12] = { 1.0, pow(2.0, 5.0/12.0), pow(4.0, 5.0/12.0), pow(8.0, 5.0/12.0), pow(16.0, 5.0/12.0), pow(32.0, 5.0/12.0), pow(64.0, 5.0/12.0), pow(128.0, 5.0/12.0), pow(256.0, 5.0/12.0), pow(512.0, 5.0/12.0), pow(1024.0, 5.0/12.0), pow(2048.0, 5.0/12.0) }; double s; int iexp; s = frexp (x, &iexp); s *= 2.0; iexp -= 1; div_t qr = div (iexp, 12); if (qr.rem < 0) { qr.quot -= 1; qr.rem += 12; } return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot); }

Addendum: What is going on here?
The request below explains how the above code works.

Overview
The above code defines two functions: double pow512norm (double x) and double pow512 (double x) . The latter is the entry point to the set; this is the function that the user code must call to calculate x ^ (5/12). The pow512norm(x) function uses Chebyshev polynomials to approximate x ^ (5/12), but only for x in the range [1,2]. (Use pow512norm(x) for x values outside this range, and the result will be garbage.)

The function pow512(x) splits the pow512(x) x into a pair (double s, int n) such that x = s * 2^n and such that 1≤ s <2. Further decomposition of n into (int q, unsigned int r) is that n = 12*q + r and r less than 12, allows me to divide the problem of finding x ^ (5/12) into parts:

x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv) ^ a = (u ^ a) (v ^ a) for positive u, v and real a.
s^(5/12) calculated through pow512norm(s) .
(2^n)^(5/12)=(2^(12*q+r))^(5/12) using a replacement.
2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a, b.
(2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) with a few manipulations.
(2^r)^(5/12) calculated from the pow2_512 lookup pow2_512 .
Calculate pow512norm(s)*pow2_512[qr.rem] and we are almost there. Here qr.rem is the value of r calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to get the desired result.
This is exactly what the ldexp math library function does.

Feature approximation
The goal here is to come up with an easily computable approximation f (x) = x ^ (5/12), which is "good enough" for the problem at hand. Our approximation should be close to f (x) in a sense. The rhetorical question: what does "close to" mean? Two competing interpretations minimize the standard error and minimize the maximum absolute error.

I will use the stock market analogy to describe the difference between the two. Suppose you want to save for your possible retirement. If you are twenty-three years old, it is best to invest in stocks or stock markets. This is due to the fact that for a fairly long period of time, the stock market on average surpasses any other investment scheme. However, we have all seen times when investing in stocks is very bad. If you are in your fifties or sixties (or 40, if you want to resign), you need to invest a little more conservatively. These recessions can damage your retirement portfolio.

Return to function approximation: as a consumer, in some approximation, you are usually worried about the worst case error, not average performance. Use some approximation designed to maximize average performance (such as least squares), and Murphy’s law requires your program to spend a lot of time using an approximation where performance is much worse than average. What you want is a minimax approximation, which minimizes the maximum absolute error over some domain. A good math library will use the minimax approach rather than the least squares method, because it allows the math library authors to give some guaranteed performance for their library.

Mathematical libraries usually use a polynomial or rational polynomial to approximate some function f (x) over a certain area a≤x≤b. Suppose that the function f (x) is analytic in this domain and you want to approximate the function by some polynomial p (x) of degree N. For a given degree N, there exists some magic unique polynomial p (x) such that p (x) -f (x ) has N + 2 extrema over [a, b] and such that the absolute values of these extrema N + 2 are equal to each other. The search for this magic polynomial p (x) is the holy grail of function approximators.

I have not found the holy grail for you. Instead, I used the Chebyshev approximation. Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very good features when it comes to approximating a function. The Chebyshev approximation is often very close to this magic polynomial p (x). (In fact, the Remez exchange algorithm that this polynomial holy graph finds usually starts with the Chebyshev approximation.)

pow512norm (x)
This function uses the Chebyshev approximation to find some polynomial p * (x), which approximates x ^ (5/12). Here I use p * (x) to distinguish this Chebyshev approximation from the magic polynomial p (x) described above. The Chebyshev approximation p * (x) is easy to find; finding p (x) is a bear. The Chebyshev approximation p * (x) is equal to sum_i Cn [i] * Tn (i, x), where Cn [i] are the Chebyshev coefficients, and Tn (i, x) are the Chebyshev polynomials estimated at the point x.

I used Wolfram alpha to find the Chebyshev Cn coefficients for me. For example, computes Cn[1] . In the first field after the input field there is the desired answer, 0.166658 in this case. These are not as many numbers as we would like. Click "More digits" and voila, you will get a lot more numbers. Tungsten alpha is free; is there a limit to how many calculations he will do. It falls within this limit for members of a higher order. (If you buy or have access to math, you can calculate these high order coefficients with a high degree of accuracy.)

Chebyshev polynomials Tn (x) are calculated in the array Tn . Besides getting something very close to the magic polynomial p (x), another reason for using the Chebyshev approximation is that the values of these Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x , and then compute iteratively Tn[i]=2*x*Tn[i-1] - Tn[i-2] . (I used "ii" as the index variable, not "i" in my code. I never use "i" as the name of the variable. How many words in English does the word "i" have in the word? Consecutive 'i's?)

pow512 (x)
pow512 is the function that the user code should call. I have already described the basics of this function above. A few more details. The math library function frexp(x) returns the value of s and the exponent of iexp for input x . (Minor problem: I want s between 1 and 2 for use with pow512norm , but frexp returns a value between 0.5 and 1.) The math library function div returns the coefficient and remainder for integer division by one wave of foop. Finally, I use the ldexp math library function to assemble three parts for the final answer.

+32

David Hammen Jun 25 2018-11-15T00:

source share

, , , pow() . :

Pow log: pow(a,b)=x(logx(a)*b) . - , x 2. , :
 a=M*2E 
, :
 log2(a)=log2(M)+E 
:
 log2(a)~=E 
, , - . , , , , , .
, - , , .

+20

PengOne 25 . '11 2:29

source share

-, . , . , 1.0/2.4, 5/12 1/3 * (1 + 1/4). cbrt ( ) sqrt (!), - , pow(). (: -O3, : i686-apple-darwin10-g++ - 4.2.1).

 #include <math.h> // cmath does not provide cbrt; C99 does. double xpow512 (double x) { double cbrtx = cbrt(x); return cbrtx*sqrt(sqrt(cbrtx)); }

+16

David Hammen 25 . '11 2:57

source share

.

2.4f 1/2.4f , , sRGB RGB. , , , . , .

, . Something like:

 __attribute__((aligned(64)) static const unsigned short SRGB_TO_LINEAR[256] = { ... }; __attribute__((aligned(64)) static const unsigned short LINEAR_TO_SRGB[256] = { ... }; void apply_lut(const unsigned short lut[256], unsigned char *src, ...

16- , . 16 , 8- . , , , sRGB , 16-/8- sRGB.

(, sRGB , , HDR , sRGB , .)

+15

Dietrich Epp 25 . '11 3:22

source share

Binomial , , [1,2). (, (1 + x) ^ a). , , .

+3

zvrba 25 . '11 5:49

source share

2.4 2.4, lirp , , , in-betweem, ( , .)

* 2/5s, , 5- . 5- Newton - , , , , , , exp log .

+1

Michael Dorgan 25 . '11 2:51

source share

, . , , . , , x pow(x, n) , , pow(x + delta, n) delta , ( ). , , , . , . , , , delta , , .

+1

Permaquid 26 . '11 21:24

source share

, , RGB RGB RGB. , . 0,0144%.

 inline double poly7(double x, double a, double b, double c, double d, double e, double f, double g, double h) { double ab, cd, ef, gh, abcd, efgh, x2, x4; x2 = x*x; x4 = x2*x2; ab = a*x + b; cd = c*x + d; ef = e*x + f; gh = g*x + h; abcd = ab*x2 + cd; efgh = ef*x2 + gh; return abcd*x4 + efgh; } inline double srgb_to_linear(double x) { if (x <= 0.04045) return x / 12.92; // Polynomial approximation of ((x+0.055)/1.055)^2.4. return poly7(x, 0.15237971711927983387, -0.57235993072870072762, 0.92097986411523535821, -0.90208229831912012386, 0.88348956209696805075, 0.48110797889132134175, 0.03563925285274562038, 0.00084585397227064120); } inline double linear_to_srgb(double x) { if (x <= 0.0031308) return x * 12.92; // Piecewise polynomial approximation (divided by x^3) // of 1.055 * x^(1/2.4) - 0.055. if (x <= 0.0523) return poly7(x, -6681.49576364495442248881, 1224.97114922729451791383, -100.23413743425112443219, 6.60361150127077944916, 0.06114808961060447245, -0.00022244138470139442, 0.00000041231840827815, -0.00000000035133685895) / (x*x*x); return poly7(x, -0.18730034115395793881, 0.64677431008037400417, -0.99032868647877825286, 1.20939072663263713636, 0.33433459165487383613, -0.01345095746411287783, 0.00044351684288719036, -0.00000664263587520855) / (x*x*x); }

sollya, :

 suppressmessage(174); f = ((x+0.055)/1.055)^2.4; p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative); p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative); print("relative:", dirtyinfnorm((fp)/f, [s;1])); print("absolute:", dirtyinfnorm((fp), [s;1])); print(canonical(p)); s = 0.0523; z = 3; f = 1.055 * x^(1/2.4) - 0.055; p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z; print("relative:", dirtyinfnorm((fp)/f, [0.0031308;s])); print("absolute:", dirtyinfnorm((fp), [0.0031308;s])); print(canonical(p)); p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z; print("relative:", dirtyinfnorm((fp)/f, [s;1])); print("absolute:", dirtyinfnorm((fp), [s;1])); print(canonical(p));

+1

orlp 23 . '16 3:27

source share

, powf(x, p) = x^p x x=2^(log2(x)) , powf(x,p) = 2^(p*log2(x)) , exp2() log2() . p , , p 0 ≤ x ≤ 1 .

p > 1 0 ≤ x ≤ 1 , p = 12/5 = 2.4 , :

 float pow12_5(float x){ float mp; // Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates] // mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3 mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4)))); // 2.6079002e-4 // mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4)))); // 8.61377e-5 // mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4)))))); // 3.524655e-5 return(mp); }

, p < 1 0 ≤ x ≤ 1 . [ ] , y=x^p=x^(p+m)/x^m , m=1,2,3 , p > 1 , , .

, , x :

 x = mx* 2^(ex) where 1 ≤ mx < 2 y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12 = mx^(5/12) * 2^(k/12) * 2^(ey)

mx^(5/12) 1 ≤ mx < 2 , , , 12 LUT 2^(k/12) . Code below:

 float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0, 0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0}; float pow5_12(float x){ union{float f; uint32_t u;} v, e2; float poff, m, e, ei; int xe; vf = x; xe = ((vu >> 23) - 127); if(xe < -127) return(0.0f); // Calculate remainder k in 2^(k/12) to find LUT e = xe * (5.0f/12.0f); ei = floorf(e); poff = powk_12LUT[(int)(12.0f * (e - ei))]; e2.u = ((int)ei + 127) << 23; // Calculate the exponent vu = (vu & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero // Approximate mx^(5/12) on [1,2), with appropriate degree minimax // m = 0x8.87592p-4 + vf * (0x8.8f056p-4 + vf * (-0x1.134044p-4)); // 7.6125e-4 // m = 0x7.582138p-4 + vf * (0xb.1666bp-4 + vf * (-0x2.d21954p-4 + vf * 0x6.3ea0cp-8)); // 8.4522726e-5 m = 0x6.9465cp-4 + vf * (0xd.43015p-4 + vf * (-0x5.17b2a8p-4 + vf * (0x1.6cb1f8p-4 + vf * (-0x2.c5b76p-8)))); // 1.04091259e-5 // m = 0x6.08242p-4 + vf * (0xf.352bdp-4 + vf * (-0x7.d0c1bp-4 + vf * (0x3.4d153p-4 + vf * (-0xc.f7a42p-8 + vf * 0x1.5d840cp-8)))); // 1.367401e-6 return(m * poff * e2.f); }

0

nimig18 15 . '17 17:15

source share

Potatoswatter · Accepted Answer · 2011-06-26 20:45

In the hacked vein, IEEE 754 is another solution that is faster and less magical. It reaches the error threshold of 0.8% in about a dozen clock cycles (for the case p = 2.4, on the Intel Merom processor).

Floating-point numbers were originally invented as an approximation to logarithms, so you can use an integer value as an approximation to log2 . This can be made somewhat possible by applying the convert-from-integer command to a floating point value to get a different floating point value.

To complete the calculation of pow , you can multiply by a constant coefficient and convert the logarithm back using the convert-to-integer command. In SSE, the corresponding instructions are cvtdq2ps and cvtps2dq .

It is not so simple. The exponent field in IEEE 754 is signed with an offset value of 127 representing a measure of zero. This bias must be removed before multiplying the logarithm and adding it again before you increase the risk. In addition, subtraction bias correction will not work to zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor in advance.

 x^p = exp2( p * log2( x ) ) = exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 ) = cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) ) = cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) ) = cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) ) = cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )

exp2( 127 / p - 127 ) is a constant factor. This function is quite specialized: it will not work with small fractional indicators, because the constant coefficient grows exponentially with the inverse of the exponent and will overflow. It will not work with negative indicators. Large exponents result in a high error because the mantissa bits are mixed with the exponent bits by multiplication.

But these are just 4 quick instructions. Pre-multiply, convert from the "whole" (to the logarithm), multiply the power, convert to the "whole" (from the logarithm). Conversions are very fast in this SSE implementation. We can also compress an additional constant coefficient into the first multiplication.

 template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg\n", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg\n", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg\n", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg\n", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg\n", ret ); return ret; }

Several studies with a score of = 2.4 show that this consistently overestimates by about 5%. (A subroutine always guarantees reevaluation.) You can simply multiply by 0.95, but a few more instructions will lead us to 4 decimal digits of accuracy, which should be enough for graphics.

The key should correspond to revaluation with underestimation and take an average value.

Calculate x ^ 0.8: four commands, error ~ + 3%.
Calculate x ^ -0.4: one rsqrtps . (This is accurate enough, but sacrifices the ability to work with zero.)
Compute x ^ 0.4: one mulps .
Calculate x ^ -0.2: one rsqrtps .
Compute x ^ 2: one mulps .
Compute x ^ 3: one mulps .
x ^ 2.4 = x ^ 2 * x ^ 0.4: one mulps . This is an overvaluation.
x ^ 2.4 = x ^ 3 * x ^ -0.4 * x ^ -0.2: two mulps . This is underestimated.
Average above: one addps , one mulps .

Team score: fourteen, including two conversions with latency = 5 and two mutual estimates of the square root with throughput = 4.

In order to correctly accept the average value, we want to weigh the estimates based on their expected errors. Underestimation increases the margin of error to 0.6 versus 0.4, so we expect it to be 1.5x as erroneous. Weighing does not add any instructions; this can be done in the prefactor. Call coefficient a: a ^ 0.5 = 1.5 a ^ -0.75 and a = 1.38316186.

The final error is about 0.015%, or 2 orders of magnitude better than the original fastpow result. The runtime is about a dozen cycles for a busy cycle with volatile source and destination variables ... although it overlaps iterations, the level of parallelism at the instruction level will also be displayed in the real world. Given SIMD, this is the throughput of one scalar result in 3 cycles!

 int main() { __m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 ); std::printf( "Input: %,vg\n", x0 ); // Approx 5% accuracy from one call. Always an overestimate. __m128 x1 = fastpow< 24, 10, 1, 1 >( x0 ); std::printf( "Direct x^2.4: %,vg\n", x1 ); // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 ); std::printf( "1.38 x^0.8: %,vg\n", xf ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( x0, x0 ); __m128 x3 = _mm_mul_ps( x2, x0 ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); std::printf( "x^2 * x^0.4: %,vg\n", x2 ); std::printf( "x^3 / x^0.6: %,vg\n", x3 ); x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) ); // Final accuracy about 0.015%, 200x better than x^0.8 calculation. std::printf( "average = %,vg\n", x2 ); }

Ok ... sorry I couldn't post this before. And expanding it to x ^ 1 / 2.4 is left as an exercise; v).

Statistics update

I implemented a small test wiring and two cases x ^{( _5/12 )} which correspond above.

 #include <cstdio> #include <xmmintrin.h> #include <cmath> #include <cfloat> #include <algorithm> using namespace std; template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden > __m128 fastpow( __m128 arg ) { __m128 ret = arg; // std::printf( "arg = %,vg\n", ret ); // Apply a constant pre-correction factor. ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. ) * pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) ); // std::printf( "scaled = %,vg\n", ret ); // Reinterpret arg as integer to obtain logarithm. asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "log = %,vg\n", ret ); // Multiply logarithm by power. ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) ); // std::printf( "powered = %,vg\n", ret ); // Convert back to "integer" to exponentiate. asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) ); // std::printf( "result = %,vg\n", ret ); return ret; } __m128 pow125_4( __m128 arg ) { // Lower exponents provide lower initial error, but too low causes overflow. __m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg ); // Imprecise 4-cycle sqrt is still far better than fastpow, good enough. __m128 xfm4 = _mm_rsqrt_ps( xf ); __m128 xf4 = _mm_mul_ps( xf, xfm4 ); // Precisely calculate x^2 and x^3 __m128 x2 = _mm_mul_ps( arg, arg ); __m128 x3 = _mm_mul_ps( x2, arg ); // Overestimate of x^2 * x^0.4 x2 = _mm_mul_ps( x2, xf4 ); // Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6. __m128 xfm2 = _mm_rsqrt_ps( xf4 ); x3 = _mm_mul_ps( x3, xfm4 ); x3 = _mm_mul_ps( x3, xfm2 ); return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) ); } __m128 pow512_2( __m128 arg ) { // 5/12 is too small, so compute the sqrt of 10/12 instead. __m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg ); return _mm_mul_ps( _mm_rsqrt_ps( x ), x ); } __m128 pow512_4( __m128 arg ) { // 5/12 is too small, so compute the 4th root of 20/12 instead. // 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow. // weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3 __m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg ); __m128 xover = _mm_mul_ps( arg, xf ); __m128 xfm1 = _mm_rsqrt_ps( xf ); __m128 x2 = _mm_mul_ps( arg, arg ); __m128 xunder = _mm_mul_ps( x2, xfm1 ); // sqrt2 * over + 2 * sqrt2 * under __m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ), _mm_add_ps( xover, xunder ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) ); return xavg; } __m128 mm_succ_ps( __m128 arg ) { return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) ); } void test_pow( double p, __m128 (*f)( __m128 ) ) { __m128 arg; for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON ); ! isfinite( _mm_cvtss_f32( f( arg ) ) ); arg = mm_succ_ps( arg ) ) ; for ( ; _mm_cvtss_f32( f( arg ) ) == 0; arg = mm_succ_ps( arg ) ) ; std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) ); int n; int const bucket_size = 1 << 25; do { float max_error = 0; double total_error = 0, cum_error = 0; for ( n = 0; n != bucket_size; ++ n ) { float result = _mm_cvtss_f32( f( arg ) ); if ( ! isfinite( result ) ) break; float actual = ::powf( _mm_cvtss_f32( arg ), p ); float error = ( result - actual ) / actual; cum_error += error; error = std::abs( error ); max_error = std::max( max_error, error ); total_error += error; arg = mm_succ_ps( arg ); } std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n", max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) ); } while ( n == bucket_size ); } int main() { std::printf( "4 insn x^12/5:\n" ); test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > ); std::printf( "14 insn x^12/5:\n" ); test_pow( 12./5, & pow125_4 ); std::printf( "6 insn x^5/12:\n" ); test_pow( 5./12, & pow512_2 ); std::printf( "14 insn x^5/12:\n" ); test_pow( 5./12, & pow512_4 ); }

Output:

 4 insn x^12/5: Domain from 1.36909e-23 error max = inf avg = inf |avg| = inf to 8.97249e-19 error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14 error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09 error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553 error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513 error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06 error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10 error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15 error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16 14 insn x^12/5: Domain from 1.42795e-19 error max = inf avg = nan |avg| = nan to 9.35823e-15 error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10 error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05 error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411 error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629 error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10 error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12 6 insn x^5/12: Domain from 9.86076e-32 error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27 error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22 error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17 error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12 error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07 error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125 error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512 error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07 error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12 error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17 error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21 error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26 error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31 error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36 error max = 0.999329 avg = nan |avg| = nan to -0 14 insn x^5/12: Domain from 2.64698e-23 error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18 error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13 error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09 error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281 error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32 error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06 error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11 error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15 error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19

I suspect that the accuracy of the more accurate 5/12 is limited to the rsqrt operation.

Optimizations for pow () with constant non-integer exponent?

Statistics update

More articles: