I have two implementations of a function that computes the Frobenius norm of a deductible highway. This function is applied to all vectors of dimension 3 of the 4D-tensor x. Then all the results are summarized. I use this as part of the convoys. TensorFlow version is 0.9.
My first implementation uses the tf.batch_ * functions.
def test1(x): """x: [batch, height, width, channels]""" s = x.get_shape().as_list() a = tf.reshape(x, [-1, s[3], 1]) c = tf.batch_matmul(a, a, adj_y=True) c2 = tf.square(c) diag = tf.batch_matrix_diag_part(c2) return tf.reduce_sum(c2) - tf.reduce_sum(diag)
This works, but the intermediate tensor c is the channels times larger than the tensor x, which limits the lot size. So I tried using a map_fn based approach:
def fn(x): x1 = tf.reshape(x, [-1, 1]) c1 = tf.matmul(x1, x1, transpose_b=True) c2 = tf.square(c1) t1 = tf.trace(c2) return tf.reduce_sum(c2)- t1) def test2(x): """x: [batch, height, width, channels]""" s = x.get_shape().as_list() a = tf.reshape(x, [-1, s[3]]) return tf.reduce_sum(tf.map_fn(fn, a))
When I run the second function, I get a lot of messages (50+), for example:
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 16084 get requests, put_count=20101 evicted_count=4000 eviction_rate=0.198995 and unsatisfied allocation rate=0
The execution time of test2 is approximately 45 times longer than the execution time of test1.
When using parallel_itrations = 10 memory usage for map_fn should be OK * 10 channels * channel, which is much lower than test1.
So now the question arises: why map_fn approach takes more time, and why it seems that it uses more memory, not less?