Transferring Learning with tf.estimator.Estimator framework

I am trying to translate the training of the Inception-resnet v2 model previously drawn on imagenet using my own dataset and classes. My source code base was a modification of the tf.slim sample that I can no longer find, and now I'm trying to rewrite the same code with the tf.estimator.* Framework.

However, I start the problem of loading only some weights from a pre-processed control point, initializing the remaining layers with my default initializers.

Having studied the problem, I found this GitHub problem and this question , both mentioning the need to use tf.train.init_from_checkpoint in my model_fn . I tried, but given the lack of examples in both, I think something is wrong with me.

This is my minimal example:

 import sys import os os.environ['CUDA_VISIBLE_DEVICES'] = '-1' import tensorflow as tf import numpy as np import inception_resnet_v2 NUM_CLASSES = 900 IMAGE_SIZE = 299 def input_fn(mode, num_classes, batch_size=1): # some code that loads images, reshapes them to 299x299x3 and batches them return tf.constant(np.zeros([batch_size, 299, 299, 3], np.float32)), tf.one_hot(tf.constant(np.zeros([batch_size], np.int32)), NUM_CLASSES) def model_fn(images, labels, num_classes, mode): with tf.contrib.slim.arg_scope(inception_resnet_v2.inception_resnet_v2_arg_scope()): logits, end_points = inception_resnet_v2.inception_resnet_v2(images, num_classes, is_training=(mode==tf.estimator.ModeKeys.TRAIN)) predictions = { 'classes': tf.argmax(input=logits, axis=1), 'probabilities': tf.nn.softmax(logits, name='softmax_tensor') } if mode == tf.estimator.ModeKeys.PREDICT: return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions) exclude = ['InceptionResnetV2/Logits', 'InceptionResnetV2/AuxLogits'] variables_to_restore = tf.contrib.slim.get_variables_to_restore(exclude=exclude) scopes = { os.path.dirname(v.name) for v in variables_to_restore } tf.train.init_from_checkpoint('inception_resnet_v2_2016_08_30.ckpt', {s+'/':s+'/' for s in scopes}) tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits) total_loss = tf.losses.get_total_loss() #obtain the regularization losses as well # Configure the training op if mode == tf.estimator.ModeKeys.TRAIN: global_step = tf.train.get_or_create_global_step() optimizer = tf.train.AdamOptimizer(learning_rate=0.00002) train_op = optimizer.minimize(total_loss, global_step) else: train_op = None return tf.estimator.EstimatorSpec( mode=mode, predictions=predictions, loss=total_loss, train_op=train_op) def main(unused_argv): # Create the Estimator classifier = tf.estimator.Estimator( model_fn=lambda features, labels, mode: model_fn(features, labels, NUM_CLASSES, mode), model_dir='model/MCVE') # Train the model classifier.train( input_fn=lambda: input_fn(tf.estimator.ModeKeys.TRAIN, NUM_CLASSES, batch_size=1), steps=1000) # Evaluate the model and print results eval_results = classifier.evaluate( input_fn=lambda: input_fn(tf.estimator.ModeKeys.EVAL, NUM_CLASSES, batch_size=1)) print() print('Evaluation results:\n %s' % eval_results) if __name__ == '__main__': tf.app.run(main=main, argv=[sys.argv[0]]) 

where inception_resnet_v2 model implementation in the Tensorflow model repository .

If I run this script, I get a bunch of informational logs from init_from_checkpoint , but then, during the creation of the session, it seems to be trying to load Logits scales from a checkpoint and fails due to form incompatibility. This is the full trace:

 Traceback (most recent call last): File "<ipython-input-6-06fadd69ae8f>", line 1, in <module> runfile('C:/Users/1/Desktop/transfer_learning_tutorial-master/MCVE.py', wdir='C:/Users/1/Desktop/transfer_learning_tutorial-master') File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile execfile(filename, namespace) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/Users/1/Desktop/transfer_learning_tutorial-master/MCVE.py", line 77, in <module> tf.app.run(main=main, argv=[sys.argv[0]]) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "C:/Users/1/Desktop/transfer_learning_tutorial-master/MCVE.py", line 68, in main steps=1000) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 302, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 780, in _train_model log_step_count_steps=self._config.log_step_count_steps) as mon_sess: File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 368, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 673, in __init__ stop_grace_period_secs=stop_grace_period_secs) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 493, in __init__ self._sess = _RecoverableSession(self._coordinated_creator) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 851, in __init__ _WrappedSession.__init__(self, self._create_session()) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 856, in _create_session return self._sess_creator.create_session() File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 554, in create_session self.tf_sess = self._session_creator.create_session() File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 428, in create_session init_fn=self._scaffold.init_fn) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\session_manager.py", line 279, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 889, in run run_metadata_ptr) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1317, in _do_run options, run_metadata) File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1336, in _do_call raise type(e)(node_def, op, message) InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [900] rhs shape= [1001] [[Node: Assign_1145 = Assign[T=DT_FLOAT, _class=["loc:@InceptionResnetV2/Logits/Logits/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](InceptionResnetV2/Logits/Logits/biases, checkpoint_initializer_1145)]] 

What am I doing wrong when using init_from_checkpoint ? How exactly should we β€œuse” it in our model_fn ? And why does the evaluator try to load the Logits ' scales from a breakpoint when I say it plainly not?

Update:

After the suggestion in the comments, I tried alternative methods of calling tf.train.init_from_checkpoint .

Using {v.name: v.name}

If, as suggested in the comment, I replace the call {v.name:v.name for v in variables_to_restore} , I get this error:

 ValueError: Assignment map with scope only name InceptionResnetV2/Conv2d_2a_3x3 should map to scope only InceptionResnetV2/Conv2d_2a_3x3/weights:0. Should be 'scope/': 'other_scope/'. 

Using {v.name: v}

If instead I try to use name:variable matching, I get the following error:

 ValueError: Tensor InceptionResnetV2/Conv2d_2a_3x3/weights:0 is not found in inception_resnet_v2_2016_08_30.ckpt checkpoint {'InceptionResnetV2/Repeat_2/block8_4/Branch_1/Conv2d_0c_3x1/BatchNorm/moving_mean': [256], 'InceptionResnetV2/Repeat/block35_9/Branch_0/Conv2d_1x1/BatchNorm/beta': [32], ... 

The error goes on to enumerate what, in my opinion, are all variable names at the breakpoint (or could these be applications?).

Update (2)

After checking the last error here above, I see that InceptionResnetV2/Conv2d_2a_3x3/weights in the list of control variables. The problem is :0 at the end! Now I will check if this really solves the problem and publish the answer if that happens.

+7
python tensorflow tensorflow-estimator
source share
1 answer

Thanks to @KathyWu's comment, I got on the right track and found the problem.

In fact, the way I calculated scopes included an InceptionResnetV2/ scope, which would start loading all variables β€œunder” the scope (that is, all variables on the network). However, replacing this with the correct dictionary was not trivial.

Of the possible visibility modes, init_from_checkpoint accepts , the one I was supposed to use was 'scope_variable_name': variable one , but without using the actual variable.name attribute .

variable.name looks like this: 'some_scope/variable_name:0' . That :0 not in the name of the checkpoint variable , so using scopes = {v.name:v.name for v in variables_to_restore} will result in a "Variable not found" error.

The trick for his work was to remove the tensor index with the name :

 tf.train.init_from_checkpoint('inception_resnet_v2_2016_08_30.ckpt', {v.name.split(':')[0]: v for v in variables_to_restore}) 
+5
source share

All Articles