MonitoredTrainingSession records more than one metagraph event per turn

Question

MonitoredTrainingSession records more than one metagraph event per turn

When writing checkpoint files using tf.train.MonitoredTrainingSession it somehow writes some metadata. What am I doing wrong?

I split it into the following code:

 import tensorflow as tf global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step") train = tf.assign(global_step, global_step + 1) saver = tf.train.Saver() hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/", save_steps = 10, saver = saver))] with tf.train.MonitoredTrainingSession(master = '', is_chief = True, checkpoint_dir = None, hooks = hooks, save_checkpoint_secs = None, save_summaries_steps = None, save_summaries_secs = None) as mon_sess: for i in range(30): if mon_sess.should_stop(): break try: gs, _ = mon_sess.run([global_step, train]) print(gs) except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e: break finally: pass

Running this will give duplicate metadata, as evidenced by the tensor warning:

 $ tensorboard --logdir ../train/test1/ --port=6006

WARNING: tensorflow: more than one graph event per turn was found or there was a metagraph containing graph_def, as well as one or more event graphs. Rewriting a graph with a new event. TensorBoard 54 start on local: 6006 (Press CTRL + C to exit)

This is in tensorflow 1.2.0 (I cannot update).

Performing the same task without a controlled session gives the correct output of the checkpoint:

 global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step") train = tf.assign(global_step, global_step + 1) saver = tf.train.Saver() init_op = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init_op) for i in range(30): gs, _ = sess.run([global_step, train]) print(gs) if i%10==0: saver.save(sess, output_path+'/test2/my-model', global_step=gs) print("Saved ckpt")

Results without tensor errors:

 $ tensorboard --logdir ../traitest2/ --port=6006

Starting TensorBoard 54 in local mode: 6006 (Press CTRL + C to exit)

I would like to fix this, as I suspect that I am missing something, and this error may have some connection with other problems that I have in distributed mode. I have to restart the tensogram anytime I want to update the data. Moreover, TensorBoard becomes very slow over time when it issues many of these alerts.

There is a related question: tensorflow More than one chart event per turn was found. In this case, the errors were caused by several starts (with different parameters) recorded in the same output directory. In this case, we are talking about one run in a clean output directory.

Running the version of MonitoredTrainingSession in distributed mode gives the same errors.

Oct-12 Update

@Nihil Kothari suggested using tf.train.MonitoredSession instead of the larger tf.train.MonitoredTrainSession wrapper as follows:

 import tensorflow as tf global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step") train = tf.assign(global_step, global_step + 1) saver = tf.train.Saver() hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/", save_steps=10, saver=saver))] chiefsession = tf.train.ChiefSessionCreator(scaffold=None, master='', config=None, checkpoint_dir=None, checkpoint_filename_with_path=None) with tf.train.MonitoredSession(session_creator=chiefsession, hooks=hooks, stop_grace_period_secs=120) as mon_sess: for i in range(30): if mon_sess.should_stop(): break try: gs, _ = mon_sess.run([global_step, train]) print(gs) except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e: break finally: pass

Unfortunately, this still gives the same errors in the tensogram:

 $ tensorboard --logdir ../train/test3/ --port=6006

WARNING: tensorflow: more than one graph event per turn was found or there was a metagraph containing graph_def, as well as one or more event graphs. Rewriting a graph with a new event. TensorBoard 54 start on local: 6006 (Press CTRL + C to exit)

btw, each code block is autonomous, copy - paste it into a Jupyter notebook and replicate the problem.

+8

python tensorflow google-cloud-ml

Bastiaan Oct 08 '17 at 22:10

source share

1 answer

Nikhil Kothari · Accepted Answer · 2017-10-11T05:20:43+0000

Interestingly, this is because every node in your cluster works with the same code, declaring itself to be the boss and preserving graphs and breakpoints.

I don’t want if is_chief = True just illustrates the message in Qaru or this is exactly what you are using ... so guess a little here.

I personally used MonitoredSession instead of MonitoredTrainingSession and created a hook list based on whether the code is working on the main / main device or not. Example: https://github.com/TensorLab/tensorfx/blob/master/src/training/_trainer.py#L94

MonitoredTrainingSession records more than one metagraph event per turn

More articles: