Using CUDA to solve a system of equations in non-linear least squares methods

Question

Using CUDA to solve a system of equations in non-linear least squares methods

Using CUDA, I would like to solve a system of equations with a nonlinear least squares solver. These techniques are discussed in an excellent brochure, which can be downloaded here .

The Jacobian matrix in my problem is sparse and lower triangular. Is there a library for CUDA available with these methods, or will I have to program these methods myself from the booklet?

Is the Gauss-Newton, Levenberg-Marquardt or Powell Method solver non-linear least squares solver available in the CUDA library (free or non-free)?

+7

gpu cuda linear-algebra mathematical-optimization hessian-matrix

Nicholas kinar Nov 07 '12 at 20:28

source share

4 answers

See also this: libflame contains implementations of many of the operations provided by the BLAS and LAPACK libraries

+1

dreamcrash Nov 08 '12 at 0:36

source share

Currently, in any library that implements the solution of a system of equations with a nonlinear least squares algorithm using the CUDA platform, it currently does not exist. These algorithms must be written from scratch using some other libraries that implement linear algebra with sparse matrices. Also, as mentioned in the comment above, cuBLAS library will help with linear algebra.

https://developer.nvidia.com/cusparse

http://code.google.com/p/cusp-library/

0

Nicholas kinar Nov 07 '12 at 23:36

source share

For those who are still looking for an answer, this is for a sparse matrix: OpenOF, "Frame for sparse nonlinear least squares optimization on a GPU"

This is for the GPU that g2o is for the CPU.

0

Alejandro Silvestri Oct 13 '15 at 13:26

source share

Jackolantern · Accepted Answer · 2015-01-31T21:32:04+0000

Before pointing out a possible simple implementation of the quasi-Newton optimization routine in CUDA, a few words about how the quasi-Newton optimizer works.

Consider a function f of N real variables x and perform a second-order expansion around some point xi:

where A is the Hessian matrix.

To find the minimum starting at point xi, Newton's method consists of forcing

what entails

and, in turn, means knowing the reverse Hessian. In addition, to reduce the function, the direction of the update

should be such that

whence it follows that

In accordance with the indicated inequality, the Hessian matrix must be definite positive. Unfortunately, the Hessian matrix is not necessarily certain positive, especially far from the minimum of f; therefore, the use of the Hessian inversion, in addition to being burdened with computational load, can also be harmful, moving the procedure even further from the minimum to areas of increasing values of f. Generally speaking, it is more convenient to use the quasi-Newton method, i.e. An approximation of the inverse to the Hessian, which preserves a certain positive and updates the iteration after the iterations converge to the opposite of the Hessian itself. A rough substantiation of the quasi-Newton method is as follows. Consider

and

Subtracting the two equations, we have an update rule for the Newton procedure

The update rule for the quasi-Newton procedure is as follows

where Hi + 1 is the matrix mentioned, which approximates the inverse sequence of the Hessian and updates after the step.

There are several rules for updating Hi + 1, and I will not go into the details of this point. A very common version of Broyden-Fletcher-Goldfarb-Shanno , but in many cases Polak-Ribiére , is quite effective.

The CUDA implementation can follow the same steps of the classic Numerical Recipes approach, but given that:

1) Vector and matrix operations can be effectively implemented by CUDA Thrust or cuBLAS; 2) The control logic can be performed by the CPU; 3) Minimization of the line, including braces and root leads, can be performed on the processor, accelerating only the cost of functional and gradient estimates of the GPU.

According to the above diagram, unknowns, gradients, and Hessians can be stored on the device without having to move them back and forth from the host to the device.

Please also note that there are some approaches in the literature that also suggest an attempt to parallelize line minimization, see

W. Fey, J. Rong, B. Wang, V. Wang, “Parallel L-BFGS-B algorithm on GPU”, “Computers and Graphics”, vol. 40, 2014, p. 1-9.

The full implementation of CUDA is available on this github page, summarizing the approach of Numerical Recipes using linmin , mkbrak and dbrent to the parallel GPU dbrent . This approach implements the Polack-Ribier scheme, but it can be easily generalized to other quasi-Newton optimization problems.

Using CUDA to solve a system of equations in non-linear least squares methods

More articles: