Before pointing out a possible simple implementation of the quasi-Newton optimization routine in CUDA, a few words about how the quasi-Newton optimizer works.
Consider a function f of N real variables x and perform a second-order expansion around some point xi:

where A is the Hessian matrix.
To find the minimum starting at point xi, Newton's method consists of forcing

what entails

and, in turn, means knowing the reverse Hessian. In addition, to reduce the function, the direction of the update

should be such that

whence it follows that

In accordance with the indicated inequality, the Hessian matrix must be definite positive. Unfortunately, the Hessian matrix is not necessarily certain positive, especially far from the minimum of f; therefore, the use of the Hessian inversion, in addition to being burdened with computational load, can also be harmful, moving the procedure even further from the minimum to areas of increasing values of f. Generally speaking, it is more convenient to use the quasi-Newton method, i.e. An approximation of the inverse to the Hessian, which preserves a certain positive and updates the iteration after the iterations converge to the opposite of the Hessian itself. A rough substantiation of the quasi-Newton method is as follows. Consider

and

Subtracting the two equations, we have an update rule for the Newton procedure

The update rule for the quasi-Newton procedure is as follows

where Hi + 1 is the matrix mentioned, which approximates the inverse sequence of the Hessian and updates after the step.
There are several rules for updating Hi + 1, and I will not go into the details of this point. A very common version of Broyden-Fletcher-Goldfarb-Shanno , but in many cases Polak-Ribiére , is quite effective.
The CUDA implementation can follow the same steps of the classic Numerical Recipes approach, but given that:
1) Vector and matrix operations can be effectively implemented by CUDA Thrust or cuBLAS; 2) The control logic can be performed by the CPU; 3) Minimization of the line, including braces and root leads, can be performed on the processor, accelerating only the cost of functional and gradient estimates of the GPU.
According to the above diagram, unknowns, gradients, and Hessians can be stored on the device without having to move them back and forth from the host to the device.
Please also note that there are some approaches in the literature that also suggest an attempt to parallelize line minimization, see
W. Fey, J. Rong, B. Wang, V. Wang, “Parallel L-BFGS-B algorithm on GPU”, “Computers and Graphics”, vol. 40, 2014, p. 1-9.
The full implementation of CUDA is available on this github page, summarizing the approach of Numerical Recipes using linmin , mkbrak and dbrent to the parallel GPU dbrent . This approach implements the Polack-Ribier scheme, but it can be easily generalized to other quasi-Newton optimization problems.