It should be noted that autocoders are trying to find out a non-trivial identification function, not an identification function. Otherwise, they would not be useful at all. Well, pre-training helps move weight vectors toward a good starting point on the error surface. Then the backpropagation algorithm, which mainly performs gradient descent, is used to improve these weights. Note that gradient descent gets stuck at closed local lows.

[Ignore the term Global Minima in the hosted image and think of it as another, better, local minimum]
Intuitively, suppose you are looking for the best way to get from source A to destination B. Having a map on which there are no routes (errors that you get at the last level of the neural network model), you say where to go. But you can go on a route that has many obstacles, hills and hills. Then suppose that someone tells you about the route in which direction he went (preliminary preparation), and gives you a new map (the starting point of the initial phase of training).
This may be an intuitive reason why, starting with random weights, and immediately starting to optimize the model with backpropagation, may not necessarily help you achieve the performance that you get with a pre-prepared model. However, please note that many models that have achieved the most up-to-date results do not use preliminary training, and they can use back propagation in combination with other optimization methods (for example, adagrad, RMSProp, Momentum and ...) to hope to avoid getting stuck in local lows are bad .

Here is the source for the second image.
Amir
source share