How hessian computing works to train a neural network

Can someone explain to me in a simple and less mathematical way what Hessian and how it works in practice when optimizing the learning process for a neural network?

+6
source share
1 answer

To understand the Hessian, you first need to understand the Jacobian, and to understand the Jacobian you need to understand the derivative

  • A derivative is a measure of how quickly the value of a function changes with an argument. Therefore, if you have a function f (x) = x ^ 2, you can calculate its derivative and get knowledge of how fast f (x + t) changes with a sufficiently small t. This gives you knowledge of the basic dynamics of a function.
  • The gradient shows you in multidimensional functions the direction of the largest change in the value (which is based on the derivatives in the directions), therefore the ie function is specified. g (x, y) = - x + y ^ 2, you should know that it is better to minimize the value of x, and greatly increase the value of y. This is the base of gradient-based methods such as steepest descent (used in traditional backpropagation techniques).
  • Jacobian is another generalization, since your function can have many values, such as g (x, y) = (x + 1, x * y, xz), so now you have 2 * 3 partial derivatives, one gradient on each output value (each of two values), forming together a matrix of values โ€‹โ€‹2 * 3 = 6.

Now the derivative shows the dynamics of the function itself. But you can take one more step, if you can use this dynamics to find the optimal function, maybe you can do even better if you know the dynamics of this dynamics, which means you can calculate second-order derivatives? This is precisely the Hessian, it is the matrix of second-order derivatives of your function. It captures the dynamics of derivatives, so how quickly (in what direction) change changes. It may seem a little complicated at first glance, but if you think about it for a while, it becomes completely clear. You want to go in the direction of the gradient, but you do not know โ€œhow farโ€ (what is the correct step size). And so you are defining a new, lesser optimization problem, where you ask, "OK, I have this gradient, how can I tell where to go?" and solve it in a similar way using derivatives (and derivatives of derivatives form Hessian).

You can also look at it in a geometric way - gradient-based optimization brings your function closer to the line . You are simply trying to find the line that is closest to your function at the current point, and therefore it determines the direction of change. Now the lines are pretty primitive, maybe we could use more complex shapes like ... parabolas? The second derivative, hessian methods just try to fit the parabola ( quadratic function, f (x) = ax ^ 2 + bx + c) to your current position. And on the basis of this approximation, he chose the real step.

It's a fun fact that adds momentum to your gradient-based optimization (under sufficient conditions), approximating hessian-based optimization (and much less expensive).

+10
source

All Articles