How to start the gradient descent algorithm when the parameter space is limited?

Question

How to start the gradient descent algorithm when the parameter space is limited?

I would like to maximize a function with one parameter.

So, I start the gradient descent (or, in fact, the ascent): I start with the initial parameter and continue to add the gradient (multiplying a certain coefficient of the learning speed, which is becoming less and less), overestimate the gradient with the new parameter and so on until convergence.

But there is one problem: my parameter should remain positive , so it should not become <= 0, because my function will be undefined. My gradient search sometimes falls into such regions, though (when it was positive, the gradient told him to go a little lower, and it overflows).

And to worsen the situation, the gradient at such a point can be negative, leading to a search for even more negative parameter values. (The reason is because the objective function contains logs, but the gradient does not.)

What are some good (simple) algorithms that deal with this problem with limited optimization? I hope for a simple fix to my algorithm. Or maybe ignore the gradient and do some linear search for the optimal parameter?

+6

optimization function math gradient

Erin Jun 29 '10 at 2:23

source share

7 answers

Victor Liu · Answer 1 · 2010-06-29T07:48:41+0000

Each time you update your parameter, check to see if it is negative, and if so, pin it to zero.
If clamping to zero is unacceptable, try adding a "log barrier" (Google it). Essentially, it adds a smooth “soft” wall to your target function (and modifications to your gradient) to remove it from regions you don’t want. Then you re-perform the optimization, hardening the wall to make it more infinitely vertical, but starting with the previously found solution. In the limit (in practice, only a few iterations are required), the problem you are solving is identical to the original problem with a hard limit.

ire_and_curses · Answer 2 · 2010-06-29T02:46:46+0000

Without knowing more about your problem, it is difficult to give specific advice. Your gradient lifting algorithm may not be particularly suitable for your function space. However, given what you have, here is one setting that will help.

You keep track of what you think is an upward gradient. But when you move forward in the direction of the gradient, you find yourself in a hole of negative value. This means that there was a local maximum nearby, as well as a very sharp negative gradient cliff. The obvious solution is to return to the previous position and take a smaller step (for example, half the size). If you still hit, repeat another step. This will repeat until you find a local maximum at the edge of the cliff.

The problem is that there is no guarantee that your local maximum is global (unless you know more about your function than you exchange). This is the main limitation of naive gradient climbing - it is always fixed at the first local maximum and converges to it. If you do not want to switch to a more reliable algorithm, one simple approach that can help is to perform n iterations of your code, starting each time from random positions in the function space and preserving the maximum maximum that you will find in general. This Monte Carlo approach increases the likelihood that you will stumble upon a global maximum by increasing the lead time by a factor of n. How effectively this will depend on the bumpiness of your target function.

Aniko · Answer 3 · 2010-06-29T03:46:47+0000

A simple trick to limit the positive parameter is to re-parameterize the problem in terms of its logarithm (be sure to replace the gradient accordingly). Of course, it is possible that the optimal move in -infty is with this conversion, and the search does not converge.

Alexandre C. · Answer 4 · 2010-06-29T08:34:31+0000

Limit the parameter to positive at each step. This is (in short) a projected gradient method that you might want about on Google.

Josephine · Answer 5 · 2010-06-29T11:37:12+0000

I have three suggestions, depending on how much thinking and work you want to do.

Firstly, during gradient descent / ascent, each time you move along the gradient time to a certain coefficient, which you call the "learning speed coefficient". If, as you described, this step causes x to become negative, there are two natural interpretations: either the gradient was too large or the coefficient of learning speed was too large. Since you cannot control the gradient, take the second interpretation. Make sure that your move causes x to become negative, and if so, halve the learning ratio and try again.

Secondly, to clarify Aniko's answer, let x be your parameter and f (x) your function. Then we define a new function g (x) = f (e ^ x) and note that although the region f is (0, infinity), the region g is (-infection, infinity). Therefore, g cannot suffer from the problems that f suffers. Use gradient descent to find the x_0 value that maximizes g. Then e ^ (x_0), positive, maximizes f. To apply gradient descent along g, you need its derivative, which is f '(e ^ x) * e ^ x, according to the chain rule.

Thirdly, it looks like you are trying to maximize only one function, and not write a general maximization procedure. You can consider the possibility of gradient descent descent and an optimization method to the features of your particular function. We would need to learn a lot more about the expected behavior of f to help you with this.

Frank · Answer 6 · 2010-07-02T04:32:28+0000

Just use the Brent method to minimize . It is stable and fast and correct if you have only one parameter. This is what the R optimize function uses. The link also contains a simple C ++ implementation. And yes, you can specify the minimum values for the MIN and MAX parameters.

Mike dunlavey · Answer 7 · 2010-06-29T14:04:50+0000

Here you get good answers. Reparametrization is what I would recommend. Also, have you considered another search method, such as Metropolis-Hastings ? This is actually quite simple as soon as you wade through scary looking maths and it gives you standard errors as well as optimal ones.

How to start the gradient descent algorithm when the parameter space is limited?

More articles: