I use shadoworflow with Titan-X GPUs, and I noticed that when I run the CIFAR10 example, Volatile GPU-utilization pretty constant around 30%, whereas when I train my own model, Volatile GPU-utilization far from stable, it almost always equal to 0% and spikes to 80/90% before returning to 0%, again and again.
I thought that this behavior was related to the way I downloaded data to the network (I received data after each step, which took some time). But after the implementation of the queue for feeding data and avoiding this delay between stages, the problem remained (see below for the queuing system).
Any idea?
batch = 128 # size of the batch x = tf.placeholder("float32", [None, n_steps, n_input]) y = tf.placeholder("float32", [None, n_classes]) # with a capacity of 100 batches, the bottleneck should not be the data feeding queue = tf.RandomShuffleQueue(capacity=100*batch, min_after_dequeue=80*batch, dtypes=[tf.float32, tf.float32], shapes=[[n_steps, n_input], [n_classes]]) enqueue_op = queue.enqueue_many([x, y]) X_batch, Y_batch = queue.dequeue_many(batch) sess = tf.Session() def load_and_enqueue(data): while True: X, Y = data.get_next_batch(batch) sess.run(enqueue_op, feed_dict={x: X, y: Y}) train_thread = threading.Thread(target=load_and_enqueue, args=(data)) train_thread.daemon = True train_thread.start() for _ in xrange(max_iter): sess.run(train_op)
source share