TL; DR: The deeper the neural network gets, the more you should pay attention to the gradient stream (see this discussion of “fading gradients”). One special case is the initialization of variables .
Problem analysis
I added tensor summaries for variables and gradients to both of your scenarios and got the following:
two layer network

three layer network

The diagrams show the distribution of the variable W:0 (first level) and how they vary from 0 to 1000 (clickable). Indeed, we see that the rate of change is much higher in a two-layer network. But I would like to draw attention to the distribution of gradients, which is much closer to 0 in a three-layer network (the first dispersion is around 0.005 , the second is about 0.000002 , that is, 1000 times less). This is a fading gradient problem.
Here is the helper code if you're interested:
for g, v in grads_and_vars: tf.summary.histogram(v.name, v) tf.summary.histogram(v.name + '_grad', g) merged = tf.summary.merge_all() writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph()) ... _, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1}) if i % 10 == 0: writer.add_summary(summary, global_step=i)
Decision
All deep networks suffer from this to some extent and there is no universal solution that automatically configures any network. But there are some methods that can push it in the right direction. Initialization is one of them.
I replaced your usual initialization as follows:
W_init = tf.contrib.layers.xavier_initializer() b_init = tf.constant_initializer(0.1)
There are many tutorials in Xavier init, you can see, for example, this one . Note that I set the bias initialization a bit positively to make sure that the ReLu outputs are positive for most neurons, at least at the beginning.
This immediately changed the image:

The scales still do not move as fast as before, but they move (pay attention to the scale of the W:0 values), and the distribution of the gradients is much less than the maximum at 0, which is much better.
Of course, this is not the end. To improve it, you must implement a full auto-encoder, since the reconstruction of the [0,0] element is currently affecting the loss, so most of the outputs are not used in optimization. You can also play with different optimizers (Adam will be my choice) and learning speed.