Cutting losses in a tensor flow (on DeepMind DQN)

I am trying to do my own work of Deepmind DQN paper in a tensor flow and find it difficult to trim the loss function.

Here is an excerpt from a nature reference describing clipping:

We also found it useful to trim the error rate from the update to a value from -1 to 1. Since the function of losing the absolute value | x | has a derivative of -1 for all negative values ​​of x and a derivative of 1 for all positive values ​​of x, the cut-off of the quadratic error should be from -1 to 1, corresponds to the use of the function of losing the absolute value for errors outside (- 1,1). This form of error trimming has further improved the stability of the algorithm.

(link to the full document: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html )

I tried using

clipped_loss_vec = tf.clip_by_value(loss, -1, 1) 

to copy the loss that I calculate between -1 and +1. In this case, the agent does not study the correct policy. I printed out the network gradients and realized that if the loss drops below -1, all the gradients suddenly turn into 0!

My reasoning about this is that trimmed losses are a constant function in (-inf, -1) U (1, inf), which means that in these areas it has a zero gradient. This, in turn, ensures that the gradients throughout the network are zero (think of it as what kind of input image I provide on the network, the loss remains at -1 in the local environment because it was cropped).

So my question has two parts:

  • What exactly does Deepmind mean in excerpt? They meant that a loss below -1 is compressed to -1 and above +1 it is tied to +1. If so, how did they deal with gradients (i.e., what is this part of the absolute value functions?)

  • How should I implement cropping constraints in a tensor stream so that the gradients do not go outside the range (but possibly stay at +1 and -1)? Thanks!

+6
source share
4 answers

I suspect that they mean that you should fix the gradient to [-1,1], and not remove the loss function. Thus, you calculate the gradient as usual, but then clamp each component of the gradient in the range [-1,1] (therefore, if it is greater than +1, you replace it with +1, if it is less than -1, you replace it with -one); and then you use the result in the step of updating the descent of the gradient instead of using an unmodified gradient.

Equivalent: Define function f as follows:

 f(x) = x^2 if x in [-0.5,0.5] f(x) = |x| - 0.25 if x < -0.5 or x > 0.5 

Instead of using a function of the form s^2 as a function of losses (where s is some complex expression), they suggest using f(s) as a function of losses. This is some kind of hybrid between quadratic losses and absolute loss of value: it will behave like s^2 when s small, but when s becomes larger, it will behave like an absolute value ( |s| ).

Note that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:

 f'(x) = 2x if x in [-0.5,0.5] f'(x) = +1 if x > +1 f'(x) = -1 if x < -1 

Thus, when you take the gradient of this loss function based on f , the result will be the same as calculating the gradient of the square of the losses and then clipping it.

Thus, what they do effectively replaces the Huber loss square loss . The function f only half the Huber loss for delta = 0.5.

Now the fact is that the following two alternatives are equivalent:

  • Use the square loss function. Calculate the gradient of this loss function, but the gradient to [-1.1] before performing the gradient descent update step.

  • Use the Huber loss function instead of the square loss function. Calculate the gradient of this loss function directly (without changes) in the gradient descent.

The first is easy to implement. The latter has good properties (improves stability, better than the absolute loss of value, since it avoids fluctuations around the minimum). Since these two are equivalent, this means that we get an easily implemented circuit that has simplicity of squared loss with the stability and reliability of Huber loss.

+4
source
  • No. They talk about trimming errors actually, and not about cutting off a cliff, which, as far as I know, refers to the same thing, but leads to confusion. They do NOT mean that a loss below -1 is compressed to -1, and a loss above +1 is compressed to +1, because this leads to zero gradients outside the error range [-1; 1], as you understand. Instead, they suggest using linear loss instead of quadratic loss for error values ​​<-1 and error values> 1.

  • Calculate the error value (r + \ gamma \ max_ {a '} Q (s', a'; \ theta_i ^ -) - Q (s, a; \ theta_i)). If this error value is within the range [-1; 1], square, if the error value is <-1, multiply by -1, if the error value> 1 leave it as it is. If you use this as a function of losses, then the gradients are outside the interval [-1; 1] will not fade.

To have a β€œsmooth” compound loss function, you can also replace quadratic losses outside the error range [-1; 1] with a first order Taylor approximation at the boundaries of -1 and 1. In this case, if e was your error value, you would have to square it in the case e \ in [-1; 1], in case e <-1, replace it with -2e-1, in case of e> 1 replace it with 2e-1.

+1
source

First of all, the code for the article is available on the Internet , which is an invaluable link.

Part 1

If you look at the code, you will see that in nql:getQUpdate ( NeuralQLearner.lua , line 180) they truncate the error value of the Q-learning function:

 -- delta = r + (1-terminal) * gamma * max_a Q(s2, a) - Q(s, a) if self.clip_delta then delta[delta:ge(self.clip_delta)] = self.clip_delta delta[delta:le(-self.clip_delta)] = -self.clip_delta end 

Part 2

In TensorFlow, assuming the last layer of your neural network is called self.output , self.actions is the heated encoding of all actions, self.q_targets_ is a placeholder with goals, and self.q is your calculation B:

 # The loss function one = tf.Variable(1.0) delta = self.q - self.q_targets_ absolute_delta = tf.abs(delta) delta = tf.where( absolute_delta < one, tf.square(delta), tf.ones_like(delta) # squared error: (-1)^2 = 1 ) 

Or using tf.clip_by_value (and having the implementation closer to the original):

 delta = tf.clip_by_value( self.q - self.q_targets_, -1.0, +1.0 ) 
+1
source
  • In the Deep Mind document that you link to, they limit the loss gradient. This prevents gigantic gradients and therefore improves reliability. They do this by using the quadratic loss function for errors within a small range and using the absolute loss of value for large errors.
  • I suggest performing the Huber loss function . Below is a python tensor stream implementation.

     def huber_loss(y_true, y_pred, max_grad=1.): """Calculates the huber loss. Parameters ---------- y_true: np.array, tf.Tensor Target value. y_pred: np.array, tf.Tensor Predicted value. max_grad: float, optional Positive floating point value. Represents the maximum possible gradient magnitude. Returns ------- tf.Tensor The huber loss. """ err = tf.abs(y_true - y_pred, name='abs') mg = tf.constant(max_grad, name='max_grad') lin = mg*(err-.5*mg) quad=.5*err*err return tf.where(err < mg, quad, lin) 
0
source

All Articles