Several sessions and graphs in Tensorflow (in the same process)

I am training a model where the input vector is the output of another model. This includes restoring the first model from the checkpoint file when initializing the second model from scratch (using tf.initialize_variables() ) in the same process.

There is a significant amount of code and abstraction, so I just paste the relevant sections here.

The following is the recovery code:

 self.variables = [var for var in all_vars if var.name.startswith(self.name)] saver = tf.train.Saver(self.variables, max_to_keep=3) self.save_path = tf.train.latest_checkpoint(os.path.dirname(self.checkpoint_path)) if should_restore: self.saver.restore(self.sess, save_path) else: self.sess.run(tf.initialize_variables(self.variables)) 

Each model is limited by its own schedule and session, for example:

  self.graph = tf.Graph() self.sess = tf.Session(graph=self.graph) with self.sess.graph.as_default(): # Create variables and ops. 

All variables in each model are created in the variable_scope context manager.

Feeding works as follows:

  • The background thread calls sess.run(inference_op) on input = scipy.misc.imread(X) and puts the result in a thread-blocking queue.
  • The main training cycle is read from the queue and calls sess.run(train_op) on the second model.

Problem:
I observe that the loss values, even at the very first iteration of training (second model), are constantly changing across the tracks (and become nano in several iterations). I confirmed that the output of the first model is exactly the same every time. Commenting on sess.run first model and replacing it with identical input from the pickled file, this behavior is not displayed.

This is train_op :

  loss_op = tf.nn.sparse_softmax_cross_entropy(network.feedforward()) # Apply gradients. with tf.control_dependencies([loss_op]): opt = tf.train.GradientDescentOptimizer(lr) grads = opt.compute_gradients(loss_op) apply_gradient_op = opt.apply_gradients(grads) return apply_gradient_op 

I know this is vague, but I am pleased to provide more detailed information. Any help is appreciated!

+5
source share
1 answer

The problem, of course, is due to the simultaneous execution of different session objects. I moved the first model session from the background stream to the main stream, repeated the controlled experiment several times (worked for more than 24 hours and reached convergence) and never observed NaN . On the other hand, simultaneous execution diverges from the model for several minutes.

I modified my code to use a common session object for all models.

+1
source

All Articles