I like to do the following, but the only thing I know is that some parameters prefer not to be regularized with L2, such as batch parameters and prejudices. LSTMs contain one Bias tensor (although it has a lot of preconceptions conceptually, they seem concatenated or something, for performance), and to normalize the batch, I add "noreg" to the variable name to ignore it.
loss = your regular output loss l2 = lambda_l2_reg * sum( tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables() if not ("noreg" in tf_var.name or "Bias" in tf_var.name) ) loss += l2
Where lambda_l2_reg is a small factor, for example: float(0.005)
Making this choice (which is the full if in the loop, discarding some variables in the regularization), he once made me jump from 0.879 F1 to 0.890 in one shot of testing the code without reconfiguring the config lambda value, and this included both changes to normalize the party and biases , and I had other offsets in the neural network.
According to this article , regularizing repeating weights can help with exploding gradients.
Also, according to this other article , clipping will be better used between hundredths of a cell, rather than inside cells if you use them.
About the Explosive Gradient issue, if you use a lossy clipping gradient that is already added to the L2 regulation, this regularization will also be taken into account during the clipping process.
PS Here is the neural network I was working on: https://github.com/guillaume-chevalier/HAR-stacked-residual-bidir-LSTMs
source share