Regularization for LSTM in tensor flow

Tensorflow offers a nice LSTM wrapper.

rnn_cell.BasicLSTM(num_units, forget_bias=1.0, input_size=None, state_is_tuple=False, activation=tanh) 

I would like to use regularization of, say, L2-regularization. However, I do not have direct access to the different weight matrices used in the LSTM cell, so I cannot explicitly do something like

 loss = something + beta * tf.reduce_sum(tf.nn.l2_loss(weights)) 

Is there a way to access matrices or somehow use regularization using LSTM?

+6
source share
3 answers

tf.trainable_variables provides a list of Variable objects that you can use to add an L2 tf.trainable_variables condition. Note that this adds regularization to all variables in your model. If you want to restrict the term L2 to only a subset of weights, you can use name_scope to specify your variables with specific prefixes, and later use this to filter variables from the list returned by tf.trainable_variables .

+9
source

I like to do the following, but the only thing I know is that some parameters prefer not to be regularized with L2, such as batch parameters and prejudices. LSTMs contain one Bias tensor (although it has a lot of preconceptions conceptually, they seem concatenated or something, for performance), and to normalize the batch, I add "noreg" to the variable name to ignore it.

 loss = your regular output loss l2 = lambda_l2_reg * sum( tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables() if not ("noreg" in tf_var.name or "Bias" in tf_var.name) ) loss += l2 

Where lambda_l2_reg is a small factor, for example: float(0.005)

Making this choice (which is the full if in the loop, discarding some variables in the regularization), he once made me jump from 0.879 F1 to 0.890 in one shot of testing the code without reconfiguring the config lambda value, and this included both changes to normalize the party and biases , and I had other offsets in the neural network.

According to this article , regularizing repeating weights can help with exploding gradients.

Also, according to this other article , clipping will be better used between hundredths of a cell, rather than inside cells if you use them.

About the Explosive Gradient issue, if you use a lossy clipping gradient that is already added to the L2 regulation, this regularization will also be taken into account during the clipping process.


PS Here is the neural network I was working on: https://github.com/guillaume-chevalier/HAR-stacked-residual-bidir-LSTMs

+6
source

Tensorflow has some built-in and helper functions that allow you to apply L2 norms to your model, such as tf.clip_by_global_norm :

  # ^^^ define your LSTM above here ^^^ params = tf.trainable_variables() gradients = tf.gradients(self.losses, params) clipped_gradients, norm = tf.clip_by_global_norm(gradients,max_gradient_norm) self.gradient_norms = norm opt = tf.train.GradientDescentOptimizer(self.learning_rate) self.updates = opt.apply_gradients( zip(clipped_gradients, params), global_step=self.global_step) 

in your test step:

  outputs = session.run([self.updates, self.gradient_norms, self.losses], input_feed) 
0
source

All Articles