Since Adam Optimizer supports a pair of averages of mean values such as mean / variance for gradients, I wonder how it should handle weight decay correctly. I saw two ways to implement it.
Update the average value / variance of the gradients based on objective loss, decompose the weight clearly in each mini-batch. (the following code is taken from https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py )
weight[:] -= lr*mean/(sqrt(variance) + self.epsilon) wd = self._get_wd(index) if wd > 0.: weight[:] -= (lr * wd) * weight
The average value / variance of the update from gradients based on objective loss + regularization loss and update weights, as usual. (the following code is taken from https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210 )
grad = scalar<DType>(param.rescale_grad) * grad + scalar<DType>(param.wd) * weight; // stuff Assign(out, req[0], weight - scalar<DType>(param.lr) * mean / (F<square_root>(var) + scalar<DType>(param.epsilon)));
. , ( , ). Caffe mxnet , , mxnet .
!
- L2, . , :
optimize(objective_loss, weight_decay=0.1)
optimize(objective_loss + 0.1*L2_of_weights, weight_decay=0)
(L1, L2 L0), , , , L2 . .