What is the right weight loss method for Adam Optimizer

Question

What is the right weight loss method for Adam Optimizer

Since Adam Optimizer supports a pair of averages of mean values such as mean / variance for gradients, I wonder how it should handle weight decay correctly. I saw two ways to implement it.

Update the average value / variance of the gradients based on objective loss, decompose the weight clearly in each mini-batch. (the following code is taken from https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py )
```
weight[:] -= lr*mean/(sqrt(variance) + self.epsilon)

wd = self._get_wd(index)
if wd > 0.:
    weight[:] -= (lr * wd) * weight
```
The average value / variance of the update from gradients based on objective loss + regularization loss and update weights, as usual. (the following code is taken from https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210 )
```
grad = scalar<DType>(param.rescale_grad) * grad +
scalar<DType>(param.wd) * weight;
// stuff
Assign(out, req[0],
   weight -
   scalar<DType>(param.lr) * mean /
   (F<square_root>(var) + scalar<DType>(param.epsilon)));
```

. , ( , ). Caffe mxnet , , mxnet .

!

+6

deep-learning caffe tensorflow torch mxnet

Xinyu Zhang 09 . '17 8:08

1

Sina Afrooze · Answer 1 · 2017-12-11T20:45:51+0000

- L2, . , :

optimize(objective_loss, weight_decay=0.1)

optimize(objective_loss + 0.1*L2_of_weights, weight_decay=0)

(L1, L2 L0), , , , L2 . .

What is the right weight loss method for Adam Optimizer

More articles: