Backpropagation is an efficient method of computational gradients in oriented computational graphs such as neural networks. This is not a teaching method, but rather a good computational trick that is often used in teaching methods . This is actually a simple implementation of the chain rule of derivatives, which just gives you the ability to calculate all the necessary partial derivatives in linear time in terms of the size of the graph (while calculating the naive gradient will scale exponentially with depth).
SGD is one of many optimization methods, namely a first-order optimizer , which means that it is based on an analysis of the gradient of an object. Therefore, from the point of view of neural networks, it is often used together with backprop for efficient updating. You can also apply SGD to gradients obtained differently (from a selection, numerical approximators, etc.). Symmetrically, you can use other optimization methods with backprop, as well as anything that the / jacobian gradient can use.
This common slip statement comes from the fact that for simplicity people sometimes say “trained with backprop,” which in fact means (if they don't specify an optimizer) “trained with SGD using backprop as a method of calculating the gradient.” In addition, in old tutorials you can find things like the “delta rule” and other somewhat confusing terms that describe exactly the same thing (since the neural network community has long been a bit independent of the general optimization community).
So you have two levels of abstraction:
- gradient calculation - when backprop comes into play
- optimization level - where methods are used, such as SGD, Adam, Rprop, BFGS, etc., which (if they are first order or higher) use the gradient calculated above.
source share