When writing checkpoint files using tf.train.MonitoredTrainingSession it somehow writes some metadata. What am I doing wrong?
I split it into the following code:
import tensorflow as tf global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step") train = tf.assign(global_step, global_step + 1) saver = tf.train.Saver() hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/", save_steps = 10, saver = saver))] with tf.train.MonitoredTrainingSession(master = '', is_chief = True, checkpoint_dir = None, hooks = hooks, save_checkpoint_secs = None, save_summaries_steps = None, save_summaries_secs = None) as mon_sess: for i in range(30): if mon_sess.should_stop(): break try: gs, _ = mon_sess.run([global_step, train]) print(gs) except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e: break finally: pass
Running this will give duplicate metadata, as evidenced by the tensor warning:
$ tensorboard --logdir ../train/test1/ --port=6006
WARNING: tensorflow: more than one graph event per turn was found or there was a metagraph containing graph_def, as well as one or more event graphs. Rewriting a graph with a new event. TensorBoard 54 start on local: 6006 (Press CTRL + C to exit)
This is in tensorflow 1.2.0 (I cannot update).
Performing the same task without a controlled session gives the correct output of the checkpoint:
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step") train = tf.assign(global_step, global_step + 1) saver = tf.train.Saver() init_op = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init_op) for i in range(30): gs, _ = sess.run([global_step, train]) print(gs) if i%10==0: saver.save(sess, output_path+'/test2/my-model', global_step=gs) print("Saved ckpt")
Results without tensor errors:
$ tensorboard --logdir ../traitest2/ --port=6006
Starting TensorBoard 54 in local mode: 6006 (Press CTRL + C to exit)
I would like to fix this, as I suspect that I am missing something, and this error may have some connection with other problems that I have in distributed mode. I have to restart the tensogram anytime I want to update the data. Moreover, TensorBoard becomes very slow over time when it issues many of these alerts.
There is a related question: tensorflow More than one chart event per turn was found. In this case, the errors were caused by several starts (with different parameters) recorded in the same output directory. In this case, we are talking about one run in a clean output directory.
Running the version of MonitoredTrainingSession in distributed mode gives the same errors.
Oct-12 Update
@Nihil Kothari suggested using tf.train.MonitoredSession instead of the larger tf.train.MonitoredTrainSession wrapper as follows:
import tensorflow as tf global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step") train = tf.assign(global_step, global_step + 1) saver = tf.train.Saver() hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/", save_steps=10, saver=saver))] chiefsession = tf.train.ChiefSessionCreator(scaffold=None, master='', config=None, checkpoint_dir=None, checkpoint_filename_with_path=None) with tf.train.MonitoredSession(session_creator=chiefsession, hooks=hooks, stop_grace_period_secs=120) as mon_sess: for i in range(30): if mon_sess.should_stop(): break try: gs, _ = mon_sess.run([global_step, train]) print(gs) except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e: break finally: pass
Unfortunately, this still gives the same errors in the tensogram:
$ tensorboard --logdir ../train/test3/ --port=6006
WARNING: tensorflow: more than one graph event per turn was found or there was a metagraph containing graph_def, as well as one or more event graphs. Rewriting a graph with a new event. TensorBoard 54 start on local: 6006 (Press CTRL + C to exit)
btw, each code block is autonomous, copy - paste it into a Jupyter notebook and replicate the problem.