Tensorflow: GPU utilization is almost always 0%

I use shadoworflow with Titan-X GPUs, and I noticed that when I run the CIFAR10 example, Volatile GPU-utilization pretty constant around 30%, whereas when I train my own model, Volatile GPU-utilization far from stable, it almost always equal to 0% and spikes to 80/90% before returning to 0%, again and again.

I thought that this behavior was related to the way I downloaded data to the network (I received data after each step, which took some time). But after the implementation of the queue for feeding data and avoiding this delay between stages, the problem remained (see below for the queuing system).

Any idea?

 batch = 128 # size of the batch x = tf.placeholder("float32", [None, n_steps, n_input]) y = tf.placeholder("float32", [None, n_classes]) # with a capacity of 100 batches, the bottleneck should not be the data feeding queue = tf.RandomShuffleQueue(capacity=100*batch, min_after_dequeue=80*batch, dtypes=[tf.float32, tf.float32], shapes=[[n_steps, n_input], [n_classes]]) enqueue_op = queue.enqueue_many([x, y]) X_batch, Y_batch = queue.dequeue_many(batch) sess = tf.Session() def load_and_enqueue(data): while True: X, Y = data.get_next_batch(batch) sess.run(enqueue_op, feed_dict={x: X, y: Y}) train_thread = threading.Thread(target=load_and_enqueue, args=(data)) train_thread.daemon = True train_thread.start() for _ in xrange(max_iter): sess.run(train_op) 
+5
source share
1 answer

After some experimentation, I found the answer to post it, as it might be useful to someone else.

Firstly, get_next_batch about 15 times slower than train_op (due to Eric Plato pointing this out).

However, I thought that the line is being fed up to capacity and that only after training was supposed to begin. Therefore, I thought that even if get_next_batch was slower, the queue should hide this delay at the beginning, at least as it contains examples of capacity , and it will only need to retrieve new data after it reaches min_after_dequeue , which is lower than capacity and that this will lead to incorrect use of the GPU.

But in fact, training begins as soon as the line reaches the min_after_dequeue examples. Thus, the queue is discarded as soon as the queue reaches the min_after_dequeue examples for running train_op , and since the queue submission time is 15 times slower than the train_op , the number of elements in the queue is lower than min_after_dequeue immediately after the first iteration of train_op and train_op should wait until the queue min_after_dequeue to the min_after_dequeue examples min_after_dequeue .

When I make train_op wait until the queue is loaded to capacity (using capacity = 100*batch ) instead of automatically starting when it reaches min_after_dequeue (using min_after_dequeue=80*batch ), the GPU will be stable for 10 seconds before return to 0%, which is understandable since the queue reaches the min_after_dequeue example in less than 10 seconds.

+9
source

All Articles