The middle layer makes the flow tensor optimizer terminated

This graph prepares a simple signal identification encoder and actually shows that the scales are being developed by the optimizer:

import tensorflow as tf import numpy as np initia = tf.random_normal_initializer(0, 1e-3) DEPTH_1 = 16 OUT_DEPTH = 1 I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output #W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights #b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases #O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1') W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases O2 = tf.matmul(O, W2) + b2 O2_0 = tf.gather_nd(O2, [[0,0]]) estimate0 = 2.0*O2_0 eval_inp = tf.gather_nd(I,[[0,0]]) k = 1e-5 L = 5.0 distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) ) opt = tf.train.GradientDescentOptimizer(1e-3) grads_and_vars = opt.compute_gradients(distance, [W, b, #W1, b1, W2, b2]) clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars] train_op = opt.apply_gradients(clipped_grads_and_vars) saver = tf.train.Saver() init_op = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init_op) for i in range(10000): print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0}) for i in range(10): print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0}) 

However, when I uncomment the intermediate hidden layer and train the resulting network, I see that the weights no longer develop:

 import tensorflow as tf import numpy as np initia = tf.random_normal_initializer(0, 1e-3) DEPTH_1 = 16 OUT_DEPTH = 1 I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1') W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases O2 = tf.matmul(O1, W2) + b2 O2_0 = tf.gather_nd(O2, [[0,0]]) estimate0 = 2.0*O2_0 eval_inp = tf.gather_nd(I,[[0,0]]) distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) ) opt = tf.train.GradientDescentOptimizer(1e-3) grads_and_vars = opt.compute_gradients(distance, [W, b, W1, b1, W2, b2]) clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars] train_op = opt.apply_gradients(clipped_grads_and_vars) saver = tf.train.Saver() init_op = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init_op) for i in range(10000): print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0}) for i in range(10): print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0}) 

Estimation estimate0 quickly converges at some fixed value, which becomes independent of the input signal. I have no idea why this is happening.

Question:

Any idea what could be wrong with the second example?

+7
python deep-learning machine-learning tensorflow autoencoder
source share
1 answer

TL; DR: The deeper the neural network gets, the more you should pay attention to the gradient stream (see this discussion of “fading gradients”). One special case is the initialization of variables .


Problem analysis

I added tensor summaries for variables and gradients to both of your scenarios and got the following:

two layer network

2-layer

three layer network

three layer network

The diagrams show the distribution of the variable W:0 (first level) and how they vary from 0 to 1000 (clickable). Indeed, we see that the rate of change is much higher in a two-layer network. But I would like to draw attention to the distribution of gradients, which is much closer to 0 in a three-layer network (the first dispersion is around 0.005 , the second is about 0.000002 , that is, 1000 times less). This is a fading gradient problem.

Here is the helper code if you're interested:

 for g, v in grads_and_vars: tf.summary.histogram(v.name, v) tf.summary.histogram(v.name + '_grad', g) merged = tf.summary.merge_all() writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph()) ... _, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1}) if i % 10 == 0: writer.add_summary(summary, global_step=i) 

Decision

All deep networks suffer from this to some extent and there is no universal solution that automatically configures any network. But there are some methods that can push it in the right direction. Initialization is one of them.

I replaced your usual initialization as follows:

 W_init = tf.contrib.layers.xavier_initializer() b_init = tf.constant_initializer(0.1) 

There are many tutorials in Xavier init, you can see, for example, this one . Note that I set the bias initialization a bit positively to make sure that the ReLu outputs are positive for most neurons, at least at the beginning.

This immediately changed the image:

3-layer superior

The scales still do not move as fast as before, but they move (pay attention to the scale of the W:0 values), and the distribution of the gradients is much less than the maximum at 0, which is much better.

Of course, this is not the end. To improve it, you must implement a full auto-encoder, since the reconstruction of the [0,0] element is currently affecting the loss, so most of the outputs are not used in optimization. You can also play with different optimizers (Adam will be my choice) and learning speed.

+10
source share

All Articles