Is it possible to use only gradient descent, for example, optimizers, with a sample code from processing gradients in TensorFlow?

Question

Is it possible to use only gradient descent, for example, optimizers, with a sample code from processing gradients in TensorFlow?

I looked at a sample code for handling gradients that has TensorFlow:

# Create an optimizer. opt = GradientDescentOptimizer(learning_rate=0.1) # Compute the gradients for a list of variables. grads_and_vars = opt.compute_gradients(loss, <list of variables>) # grads_and_vars is a list of tuples (gradient, variable). Do whatever you # need to the 'gradient' part, for example cap them, etc. capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars] # Ask the optimizer to apply the capped gradients. opt.apply_gradients(capped_grads_and_vars)

however, I noticed that the apply_gradients function was obtained from GradientDescentOptimizer . Does this mean that using the sample code above, you can only apply gradient descent rules (note that we could change opt = GradientDescentOptimizer or Adam or any other optimizer)? In particular, what does apply_gradients do? I am finally checking the code on the tf github page, but it was a bunch of python that had nothing to do with mathematical expressions, so it was hard to say what it does and how it changed from optimizer to optimizer.

For example, if I wanted to implement my own custom optimizer that could use gradients (or maybe not just change the weight directly with some rule, maybe a more biologically plausible rule), is this impossible with the above code example?

In particular, I wanted to implement a version of gradient descent, which is artificially limited in a compact domain. In particular, I wanted to implement the following equation:

 w := (w - mu*grad + eps) mod B

at TensorFlow. I realized that the following is true:

 w := w mod B - mu*grad mod B + eps mod B

so I thought I could just implement it by doing:

 def Process_grads(g,mu_noise,stddev_noise,B): return (g+tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) ) % B

and then just:

 processed_grads_and_vars = [(Process_grads(gv[0]), gv[1]) for gv in grads_and_vars] # Ask the optimizer to apply the processed gradients. opt.apply_gradients(processed_grads_and_vars)

however, I realized that this was not enough, because I actually do not have access to w , so I can not implement:

 w mod B

at least not the way I tried. Is there any way to do this? those. actually change the update rule? At least how I tried?

I know my hacker rule for updating, but my task is to change the update equation rather than actually care about this update rule (so don't be obsessed with it if it's a little weird).

I came up with a super hacker solution:

 def manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise): with tf.variable_scope(arg.mdl_scope_name,reuse=True): W_var = tf.get_variable(name='W') eps = tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) # W_new = tf.mod( W_var - learning_rate*g + eps , 20) sess.run( W_var.assign(W_new) ) def manual_GDL(arg,loss,learning_rate,mu_noise,stddev_noise,compact,B): # Compute the gradients for a list of variables. grads_and_vars = opt.compute_gradients(loss) # process gradients processed_grads_and_vars = [(manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise), v) for g,v in grads_and_vars]

not sure if it works, but something like this should work as a whole. The idea is to simply write down the equation you need to use ( in TensorFlow ) for the training speed, and then manually update the scales using a session.

Unfortunately, such a solution means that we have to take care of annealing (the disruptive speed of manual training, which seems annoying). This solution probably has many other problems, feel free to specify them (and give solutions if you can).

For this very simple problem, I realized that you can just make a normal optimizer rule, and then just take the mode of the weights and reassign them to their value:

 sess.run(fetches=train_step) if arg.compact: # apply w := ( w - mu*g + eps ) mod B W_val = W_var.eval() W_new = tf.mod(W_var,arg.B).eval() W_var.assign(W_new).eval()

but in this case it is a coincidence that such a simple solution exists (unfortunately, bypasses the whole point of my question).

In fact, these solutions significantly slow down the code. At the moment, this is the best I have.

As a link, I saw this question: How to create an optimizer in Tensorflow , but did not find the answer directly to my question.

+7

python machine-learning neural-network tensorflow conv-neural-network

Pinocchio Mar 18 '17 at 5:09

source share

2 answers

patapouf_ai · Answer 1 · 2017-03-26T01:53:29+0000

Indeed, you are somewhat limited and cannot do anything. However, what you want to do is easy to do by creating a child class of the tensorflow Optimizer class.

All you have to do is write the _apply_dense method for your class. The _apply_dense method takes grad and w as arguments, so whatever you want to do with these variables you can do.

Take a look here, for example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py This is an implementation of Adam in a tensor stream, all you have to do is change _apply_dense on line 131 , as well as the _prepare and _finish .

So for example:

 def _apply_dense(self, grad, var): B = math_ops.cast(self.B, var.dtype.base_dtype) eps = math_ops.cast(self.eps, var.dtype.base_dtype) mu = math_ops.cast(self.mu, var.dtype.base_dtype) var_update = state_ops.assign(var, tf.floormod(var - mu*grad + eps,B), use_locking=self._use_locking) return var_update

Bluesun · Answer 2 · 2017-03-27T19:38:03+0000

Your solution slows the code down because you use the sess.run and .eval() code when creating "train_step". Instead, you should create a train_step graph using only the internal functions of the tensor flow (without using sess.run and .eval() ). After that, you only evaluate train_step in the loop.

If you don’t want to use any standard optimizer, you can write your own “apply gradient” graph. Here is one possible solution for this:

 learning_rate = tf.Variable(tf.constant(0.1)) mu_noise = 0. stddev_noise = 0.01 #add all your W variables here when you have more than one: train_w_vars_list = [W] grad = tf.gradients(some_loss, train_w_vars_list) assign_list = [] for g, v in zip(grad, train_w_vars_list): eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise) assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20))) #also update the learning rate here if you want to: assign_list.append(learning_rate.assign(learning_rate - 0.001)) train_step = tf.group(*assign_list)

You can also use one of the standard optimizers to create a list of grads_and_vars (then use it instead of zip (grad, train_w_vars_list)).

Here is a simple example for MNIST with your loss:

 from __future__ import absolute_import from __future__ import division from __future__ import print_function from tensorflow.examples.tutorials.mnist import input_data import tensorflow as tf # Import data mnist = input_data.read_data_sets('PATH TO MNIST_data', one_hot=True) # Create the model x = tf.placeholder(tf.float32, [None, 784]) W = tf.Variable(tf.zeros([784, 10])) y = tf.matmul(x, W) # Define loss and optimizer y_ = tf.placeholder(tf.float32, [None, 10]) cross_entropy = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)) learning_rate = tf.Variable(tf.constant(0.1)) mu_noise = 0. stddev_noise = 0.01 #add all your W variables here when you have more than one: train_w_vars_list = [W] grad = tf.gradients(cross_entropy, train_w_vars_list) assign_list = [] for g, v in zip(grad, train_w_vars_list): eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise) assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20))) #also update the learning rate here if you want to: assign_list.append(learning_rate.assign(learning_rate - 0.001)) train_step = tf.group(*assign_list) sess = tf.InteractiveSession() tf.global_variables_initializer().run() # Train for _ in range(1000): batch_xs, batch_ys = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) # Test trained model correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

Is it possible to use only gradient descent, for example, optimizers, with a sample code from processing gradients in TensorFlow?

More articles: