Why should neural network weights be initialized with random numbers?

I am trying to build a neural network from scratch. There is consensus throughout the AI ​​literature that scales must be initialized with random numbers so that the network converges faster.

But why are the initial weights of neural networks initialized as random numbers?

I read somewhere that this is done to "break the symmetry", and this speeds up the work of the neural network. How does symmetry breaking make him learn faster?

Would initializing weights 0 be better? Thus, the scales will be able to quickly find their values ​​(positive or negative)?

Is there any other fundamental philosophy for randomizing weights without hoping that they will be close to their optimal values ​​during initialization?

+84
artificial-intelligence machine-learning neural-network mathematical-optimization gradient-descent
Nov 17 '13 at 5:34 on
source share
5 answers

Discontinuous symmetry is important here, not the cause of performance. Imagine the first 2 layers of a multilayer perceptron (input and hidden layers):

enter image description here

With direct distribution, each block in the hidden layer receives a signal:

enter image description here

That is, each hidden block receives the sum of inputs multiplied by the corresponding weight.

Now imagine that you initialize all weights to one value (for example, zero or one). In this case, each hidden block will receive exactly the same signal . For example. if all weights are initialized to 1, each block receives a signal equal to the sum of the inputs (and the outputs sigmoid(sum(inputs)) ). If all weights are zeros, worse, every hidden unit will receive a zero signal. No matter what the input was - if all the weights are the same, all the units in the hidden layer will be the same. .

This is the main problem with symmetry and the reason why you should initialize the scales randomly (or at least with different values). Note that this problem affects all architectures that use each connection to each.

+111
Nov 17 '13 at 10:55
source share

Analogy:

Hope this is a good analogy. I tried to explain it as simple as possible.

Imagine that someone threw you from a helicopter to an unknown peak of a mountain, and you were trapped there. Everywhere fog. The only thing you know is that you must somehow go down to sea level. What direction should you take to go down to the lowest point?

If you cannot find a way to sea level, and the helicopter will again take you and lower you to the same top of the mountain. You will have to follow the same instructions again, because you “initialize” yourself with the same starting position .

However, every time a helicopter drops you somewhere on a mountain accidentally , you take different directions and steps. Thus, you will have a better chance of reaching the lowest point.

This is what is meant by symmetry breaking . Initialization is asymmetric (which is different), so you can find different solutions for the same problem.

In this analogy, where you land is weight . Thus, with different weights, there is a greater chance of reaching a lower (or lower) point.

In addition, it increases the entropy in the system, so the system can create more information to help you find the lowest points (local or global lows).

enter image description here

+56
Nov 10 '16 at 10:53 on
source share

The answer is pretty simple. Basic learning algorithms are greedy in nature - they do not find the global optimal, but rather the “closest” local solution. As a result, starting with any fixed initialization, you reject your decision with respect to a specific set of weights. If you do this randomly (and possibly many times), you are much less likely to get stuck in some strange part of the surface of the error.

The same argument applies to other algorithms that cannot find the global optimum (k-mean, EM, etc.) and do not apply to global optimization methods (for example, the SMO algorithm for SVM).

+27
Nov 17 '13 at 8:52
source share

As you mentioned, the key is symmetry breaking . Because if you initialize all weights to zero, then all hidden neurons (units) in your neural network will do exactly the same calculations. This is not what we want, because we want different hidden blocks to calculate different functions. However, this is not possible if you initialize all the same value.

0
Feb 05 '19 at 13:11
source share
  1. Wouldn't initializing weights to 0 be a better idea? Thus, can weights find their values ​​faster (positive or negative)?

  2. How does symmetry breaking make learning faster?

If you initialize all weights to zero, then all neurons of all layers will perform the same calculations, providing the same result and outputting them, making the entire deep network useless. If the weights are zero, the complexity of the entire deep network will be the same as that of a single neuron, and the predictions will be no better than random.

The nodes located next to the hidden layer connected to the same inputs must have different weights so that the learning algorithm can update the weights.

By making the weights nonzero (but close to 0, like 0.1, etc.), the Algorithm will study the weights in the following iterations and will not freeze. Thus, symmetry breaking occurs.

  1. Is there any other basic philosophy behind the randomization of weights besides the hope that they will be close to the optimal values ​​during initialization?

Stochastic optimization algorithms, such as stochastic gradient descent, use randomness when choosing the starting point for the search and during the search.

The process of finding or training a neural network is known as convergence. Finding a suboptimal solution or local optima leads to premature convergence.

Instead of relying on one local optimum, if you run your algorithm several times with different random weights, there is a better way to find a global optimum without getting stuck in a local optimum.

After 2015, due to advances in machine learning research, He-et-al Initialization was introduced instead of random initialization

 w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(2/layer_size[l-1]) 

Weights are still random, but vary in range depending on the size of the previous layer of neurons.

So non-zero random weights help us

  1. Get out of local optima
  2. Symmetry breaking
  3. Reach global optimality in further iterations

Recommendations:

machinelearningmastery

towardsdatascience

0
Mar 27 '19 at 18:54
source share



All Articles