How to set Adadelta algorithm parameters in Tensorflow correctly?

Question

How to set Adadelta algorithm parameters in Tensorflow correctly?

I use Tensorflow for regression purposes. My neural network is very small with 10 input neurons, 12 hidden neurons in one layer and 5 output neurons.

Activation Function - relu
cost is the square distance between the output and the actual value.
my neural network is training correctly with other optimizers such as GradientDescent, Adam, Adagrad.

However, when I try to use Adadella, the neural network simply will not train. Variables remain unchanged at every step.

I tried with every initial learning_rate possible (from 1.0e-6 to 10) and with different weight initialization: it is always the same.

Does anyone have a little idea of what's going on?

Thank you very much

+5

python neural-network tensorflow

Amaury mercier Jul 28 '16 at 9:36

source share

1 answer

Olivier moindrot · Answer 1 · 2016-07-28T12:15:58+0000

Short answer: do not use Adadelta h3>

Very few people use it today; instead, stick to:

tf.train.MomentumOptimizer with a momentum of 0.9 is very standard and works well. The disadvantage is that you need to find the best learning rate.
tf.train.RMSPropOptimizer : Results are less dependent on good learning speed. This algorithm is very similar to Adadelta strong>, but works better in my opinion.

If you really want to use Adadelta, use the paper options: learning_rate=1., rho=0.95, epsilon=1e-6 . More epsilon will help from the start, but be prepared to wait a little longer than with other optimizers to see convergence.

Please note that in the document they do not even use the learning speed, which is equal to the fact that it is equal to 1 .

Long answer

Adadella has a very slow start. Full algorithm from:

The problem is that they accumulate a square of updates.

In step 0, the average for these updates is zero, so the first update will be very small.
As the first update is very small, the average update value will be very small at the beginning, which at the beginning is a vicious cycle.

I think Adadelta works better with large networks than yours, and after some iterations it should equal the performance of RMSProp or Adam.

Here is my code to play a little with the Adadelta optimizer:

 import tensorflow as tf v = tf.Variable(10.) loss = v * v optimizer = tf.train.AdadeltaOptimizer(1., 0.95, 1e-6) train_op = optimizer.minimize(loss) accum = optimizer.get_slot(v, "accum") # accumulator of the square gradients accum_update = optimizer.get_slot(v, "accum_update") # accumulator of the square updates sess = tf.Session() sess.run(tf.initialize_all_variables()) for i in range(100): sess.run(train_op) print "%.3f \t %.3f \t %.6f" % tuple(sess.run([v, accum, accum_update]))

The first 10 lines:

  v accum accum_update 9.994 20.000 0.000001 9.988 38.975 0.000002 9.983 56.979 0.000003 9.978 74.061 0.000004 9.973 90.270 0.000005 9.968 105.648 0.000006 9.963 120.237 0.000006 9.958 134.077 0.000007 9.953 147.205 0.000008 9.948 159.658 0.000009

How to set Adadelta algorithm parameters in Tensorflow correctly?

Short answer: do not use Adadelta h3>

Long answer

More articles: