After various additional experiments, I was convinced that a simple argument conversion, which does not use higher accuracy than the argument and the result, cannot achieve a more severe error than that which was achieved by the first option in the code I published.
Since my question is to minimize the error for argument conversion that occurs in addition to the error in log1pf() , the easiest approach to use for experimenting is to use a correctly rounded implementation of this logarithm function. Note that an implementation with the correct rounding is unlikely to exist in the context of a high-performance environment. According to the works of J.-M. Muller et. al., to obtain accurate results with one precision, it is enough to calculate the advanced calculation of x86, for example
float accurate_log1pf (float a) { float res; __asm fldln2; __asm fld dword ptr [a]; __asm fyl2xp1; __asm fst dword ptr [res]; __asm fcompp; return res; }
The implementation of asinhf() using the first option from my question is as follows:
float my_asinhf (float a) { float fa, s, t; fa = fabsf (a); if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation t = log1pf (fa) + 0x1.62e430p-1f; // log(2) } else { t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa); t = accurate_log1pf (t); } return copysignf (t, a); // restore sign }
Testing with all operands with a 1st degree of accuracy of 2 32 IEEE-754 shows that a maximum error of 1.49486070 ulp occurs at ± 0x1.ff5022p-9 and there are 353 521 140 incorrect rounded results. What happens if the conversion of the entire argument uses double precision arithmetic? Code changes to
float my_asinhf (float a) { float fa, s, t; fa = fabsf (a); if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation t = log1pf (fa) + 0x1.62e430p-1f; // log(2) } else { double tt = fa; tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt); t = (float)tt; t = accurate_log1pf (t); } return copysignf (t, a); // restore sign }
However, with this change, the error does not improve! The maximum error of 1.49486070 ulp is still found at ± 0x1.ff5022p-9 , and now there are 0x1.ff5022p-9 incorrectly rounded results, slightly less than before. The problem is that the float operand cannot pass enough information to log1pf() to get more accurate results. A similar problem arises when computing sinf() and cosf() . If the argument presented, represented as a correctly rounded operand of float , is passed to the main polynomials, the resulting error in sinf() and cosf() is only slightly less than 1.5 ulp, as we observe with my_asinhf() .
One solution is to compute the converted argument higher than the single argument, for example, as a pair of double-float operands (a useful overview of double-float methods can be found in this document by Andrew Tall ). In this case, we can use additional information to perform linear interpolation according to the result based on the knowledge that the derivative of the logarithm is inverse. This gives us:
float my_asinhf (float a) { float fa, s, t; fa = fabsf (a); if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation t = log1pf (fa) + 0x1.62e430p-1f; // log(2) } else { double tt = fa; tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt); t = (float)tt; // "head" of double-float s = (float)(tt - (double)t); // "tail" of double-float t = fmaf (s, 1.0f / (1.0f + t), accurate_log1pf (t)); // interpolate } return copysignf (t, a); // restore sign }
An exhaustive test of this version shows that the maximum error has been reduced to 0.99999948 ulp, it occurs at ± 0x1.deeea0p-22 . There are 349,653,534 incorrectly rounded results. A targeted implementation of asinhf() achieved.
Unfortunately, the practical usefulness of this result is limited. Depending on the HW platform, the throughput of arithmetic operations on double can be from 1/2 to 1/32 of the productivity of float operations. A double-precision calculation can be replaced by a double-float calculation, but it will also be costly. Finally, my approach here was to use a single-precision implementation as evidence for subsequent double-precision work, and many hardware platforms (of course, everything that interests me) do not offer hardware support for a higher-precision number format than IEEE -754 binary64 (double precision). Therefore, any solution should not require high-precision arithmetic in intermediate calculations.
Since all the unpleasant arguments in the case of asinhf() are small in magnitude, it was possible [partially?] To solve the problem of accuracy using the polynomial minimax approximation for the region around the coordinate origin. Since this would create another branch of code, this would most likely complicate vectology.