TensorFlow or Theano: how do they know the derivative of the loss function based on the neural network graph?

In TensorFlow or Theano, you tell the library how your neural network works and how the forward function should work.

For example, in TensorFlow you should write:

with graph.as_default(): _X = tf.constant(X) _y = tf.constant(y) hidden = 20 w0 = tf.Variable(tf.truncated_normal([X.shape[1], hidden])) b0 = tf.Variable(tf.truncated_normal([hidden])) h = tf.nn.softmax(tf.matmul(_X, w0) + b0) w1 = tf.Variable(tf.truncated_normal([hidden, 1])) b1 = tf.Variable(tf.truncated_normal([1])) yp = tf.nn.softmax(tf.matmul(h, w1) + b1) loss = tf.reduce_mean(0.5*tf.square(yp - _y)) optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss) 

I use the L2-norm loss function, C = 0.5 * sum ((y-yp) ^ 2), and at the backpropagation stage, the derivative, dC = sum (y-yp), is supposed to be calculated. See (30) in this book .

My question is: how does TensorFlow (or Theano) know the analytic derivative for backpropagation? Or are they making an approximation? Or somehow not use the derivative?

I took an in-depth study of udacity at TensorFlow, but I still disagree on how to understand how these libraries work.

+6
source share
1 answer

Differentiation occurs on the last line:

  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss) 

When you execute the minimize() method, TensorFlow identifies the set of variables that loss depends on and calculates the gradients for each of them. Differentiation is implemented in ops/gradients.py and uses "backward accumulation" . In essence, he is looking back on tensor loss variables, applying the rule of the chain for each operator in the data flow graph. TensorFlow includes β€œgradient functions” for most (differentiable) operators, and you can see an example of how they are implemented in ops/math_grad.py . The gradient function can use the source op (including its inputs, outputs, and attributes) and the gradients calculated for each of its outputs to create gradients for each of its inputs.

Page 7 of Ilya Sutskvest Ph.D. thesis has a good explanation of how this process works as a whole.

+9
source

All Articles