How can I implement maximum norm limits in MLP in tensor flow?

How can I implement maximum norm limits on scales in MLP in tensor flow? What Hinton and Dean describe in their work on dark knowledge. That is, tf.nn.dropout implements default weight limits or we need to do this explicitly, as in

https://arxiv.org/pdf/1207.0580.pdf

“If these networks have the same weights for the hidden units that are present. We use the standard stochastic gradient descent procedure to train dropout neural networks in mini-batches of training cases, but we change the penalty period, which is usually used to prevent too large weights. Instead in order to fine the square of the length (Norm L2) of the entire weight vector, we set the upper limit of the norm L2 of the incoming weight vector for each individual hidden unit. If updating the weight violates this restriction "we renormalize the weight of the hidden unit by division."

Keras seems to have this

http://keras.io/constraints/

+6
source share
2 answers

tf.nn.dropout does not impose any normal limits. I believe you are looking for “ handle gradients before applying them ” using tf.clip_by_norm .

For example, instead of simply:

 # Create an optimizer + implicitly call compute_gradients() and apply_gradients() optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) 

You can:

 # Create an optimizer. optimizer = tf.train.GradientDescentOptimizer(learning_rate) # Compute the gradients for a list of variables. grads_and_vars = optimizer.compute_gradients(loss, [weights1, weights2, ...]) # grads_and_vars is a list of tuples (gradient, variable). # Do whatever you need to the 'gradient' part, for example cap them, etc. capped_grads_and_vars = [(tf.clip_by_norm(gv[0], clip_norm=123.0, axes=0), gv[1]) for gv in grads_and_vars] # Ask the optimizer to apply the capped gradients optimizer = optimizer.apply_gradients(capped_grads_and_vars) 

Hope this helps. Final remarks on the tf.clip_by_norm axes parameter:

  • If you compute tf.nn.xw_plus_b(x, weights, biases) or is equivalent to matmul(x, weights) + biases , when the dimensions x and weights are (batch, in_units) and (in_units, out_units) respectively, then you are probably want to set axes == [0] (because in this use each column points all incoming weights to a specific block).
  • Pay attention to the shape / size of your variables above and / or exactly how you want clip_by_norm each of them! For instance. if some of [weights1, weights2, ...] are matrices and some are not, and you call clip_by_norm() in grads_and_vars with the same axes value as in the list above, this does not mean the same for all variables! In fact, if you are lucky, this will lead to a strange error, for example, ValueError: Invalid reduction dimension 1 for input with 1 dimensions , but otherwise it is a very insightful error.
+3
source

You can use tf.clip_by_value:

https://www.tensorflow.org/versions/r0.10/api_docs/python/train/gradient_clipping

Gradient clipping is also used to prevent mass explosion in recurrent neural networks.

+2
source

All Articles