Accurate calculation of modified Bessel functions - Using Netlib Fortran routines in CUDA?

I am dealing with the problem of accurately computing the modified zero-order Bessel function I0 in CUDA.

For a long time I used the rational Chebyshev approximation according to the document

JM Blair, "Rational Chebyshev approximations for the modified Bessel functions I_0 (x) and I_1 (x)", Math. Comput., Vol. 28, n. 126, pp. 581-583, April 1974

which, compared to the result presented by Matlab, gives an average error of the order of 1e-29. Unfortunately, this seemingly high precision is no longer enough for the new application I'm working on.

Matlab uses Fortran routines developed by D.E. Amos

Amos, D.E., "A Package of Subprograms for Bessel Functions of a Complex Argument and Non-Negative Order", "Report on the Sandia National Laboratory", SAND85-1018, May 1985

Amos, D.E., "A Portable Package for Bessel Functions of a Complex Argument and Non-Negative Order", vol. Mathematics Software, 1986.

which are available for download from the netlib / amos website.

There are ways to use these Fortran routines in C / C ++ code by compiling them into a library file and then using a C / C ++ wrapper (see, for example, netlib_wrapping ). I am wondering if there are any means to force the device functions from these Fortran programs to be called by the CUDA kernel).

MORE DETAILS ON THE PROBLEM

I have two codes: one is written in Matlab and one in CUDA. Both modes work in three stages:

1) Scaling by the modified Bessel function I0 and zero data filling;

2) FFT;

3) Interpolation.

I compare both with the โ€œexactโ€ result: as a result of step 3), Matlab gives a relative standard error of 1e-10%, and CUDA gives 1 -2%, so I started to investigate the reason.

The root mean square difference between the first step of the two codes, namely 100*sqrt(sum(abs(U_Matlab_step_1-U_CUDA_step_1).^2))/sqrt(sum(abs(U_Matlab_step_1).^2)) , is 0% ( mean(mean(abs(U_Matlab-U_CUDA)))=6e-29 ), so I would say that this is good. Unfortunately, when I go to step 2, the error rises to 2e-4% . Finally, if I file step 2) CUDA with the output of step 1) Matlab, then the standard error in step 2) will become 1e-14% , which makes me think that the source of inaccuracy is due to the first step, namely, the calculation of the modified Bessel function.

FOR INTERESTING DEVELOPMENTS OF THIS DISCUSSION

Check out the NVIDIA Developer Zone

+4
source share
2 answers

I wonder if this can be explained by differences in accuracy between floating point operations.

There are a few things to check.

  • Cuda 5 adds several new features that may better suit the format of your calculations. I also think that the CUDA math library since version 4 has some features of Bessel, although I'm not sure if this is true or how relevant they are to your problem.
  • Can you write a serial version of the processor for testing? This will tell you if your precision problems are related to optimization, for example, using 64-bit and 80-bit number images. When optimization is turned off, your computer will mainly deal with 80-bit representation (MATLAB probably does this), while with the math optimization enabled on your compiler, you can deal with a less accurate 64-bit representation. This leads to the difference between x87 and SSE.
  • Various computer hardware have a slightly different accuracy. For example, computing 2.0 makes FMA more accurate and more closely resembles optimized x86.
  • Is there a physical reason to take Matlab as the right one? Maybe your algorithm underestimates the results while Matlab is overshooted. This situation may occur if CUDA groups operations when Matlab does not.
  • If you need to recreate Matlab results, you can try to customize every step of your code by comparing the result with various rounding tricks. See the table.

Rounding table

 addition | x + y | __dadd_[rn|rz|ru|rd](x, y) multiplication | x * y | __dmul_[rn|rz|ru|rd](x, y) Fused-Mult-Add | fma(x, y, z) | __fma_[rn|rz|ru|rd](x, y, z) reciprocal | 1.0 / x | __drcp_[rn|rz|ru|rd](x) division | x / y | __ddiv_[rn|rz|ru|rd](x, y) square root | sqrt(x) | __dsqrt_[rn|rz|ru|rd](x) mode | interpretation rn | round to nearest, ties to even rz | round towards zero ru | round towards +โˆž rd | round towards -โˆž 

From http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

+3
source

I found an introductory technical conversation that answers your question. Here is the link to the PDF . So yes, maybe, but I couldnโ€™t get out the above script to convert the old fortran code to CUDA C, maybe contact the developers directly.

0
source

All Articles