I suspect that they mean that you should fix the gradient to [-1,1], and not remove the loss function. Thus, you calculate the gradient as usual, but then clamp each component of the gradient in the range [-1,1] (therefore, if it is greater than +1, you replace it with +1, if it is less than -1, you replace it with -one); and then you use the result in the step of updating the descent of the gradient instead of using an unmodified gradient.
Equivalent: Define function f as follows:
f(x) = x^2 if x in [-0.5,0.5] f(x) = |x| - 0.25 if x < -0.5 or x > 0.5
Instead of using a function of the form s^2 as a function of losses (where s is some complex expression), they suggest using f(s) as a function of losses. This is some kind of hybrid between quadratic losses and absolute loss of value: it will behave like s^2 when s small, but when s becomes larger, it will behave like an absolute value ( |s| ).
Note that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:
f'(x) = 2x if x in [-0.5,0.5] f'(x) = +1 if x > +1 f'(x) = -1 if x < -1
Thus, when you take the gradient of this loss function based on f , the result will be the same as calculating the gradient of the square of the losses and then clipping it.
Thus, what they do effectively replaces the Huber loss square loss . The function f only half the Huber loss for delta = 0.5.
Now the fact is that the following two alternatives are equivalent:
Use the square loss function. Calculate the gradient of this loss function, but the gradient to [-1.1] before performing the gradient descent update step.
Use the Huber loss function instead of the square loss function. Calculate the gradient of this loss function directly (without changes) in the gradient descent.
The first is easy to implement. The latter has good properties (improves stability, better than the absolute loss of value, since it avoids fluctuations around the minimum). Since these two are equivalent, this means that we get an easily implemented circuit that has simplicity of squared loss with the stability and reliability of Huber loss.