What is the right weight loss method for Adam Optimizer

Since Adam Optimizer supports a pair of averages of mean values ​​such as mean / variance for gradients, I wonder how it should handle weight decay correctly. I saw two ways to implement it.

  • Update the average value / variance of the gradients based on objective loss, decompose the weight clearly in each mini-batch. (the following code is taken from https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py )

    weight[:] -= lr*mean/(sqrt(variance) + self.epsilon)
    
    wd = self._get_wd(index)
    if wd > 0.:
        weight[:] -= (lr * wd) * weight
    
  • The average value / variance of the update from gradients based on objective loss + regularization loss and update weights, as usual. (the following code is taken from https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210 )

    grad = scalar<DType>(param.rescale_grad) * grad +
    scalar<DType>(param.wd) * weight;
    // stuff
    Assign(out, req[0],
       weight -
       scalar<DType>(param.lr) * mean /
       (F<square_root>(var) + scalar<DType>(param.epsilon)));
    

. , ( , ). Caffe mxnet , , mxnet .

!

+6
1

- L2, . , :

optimize(objective_loss, weight_decay=0.1)

optimize(objective_loss + 0.1*L2_of_weights, weight_decay=0)

(L1, L2 L0), , , , L2 . .

+1

All Articles