The problem here is that for each session.run there is a fixed overhead cost, and filling the queue with many small examples in the queue will be slow.
In particular, each session.run is about 100-200 usec, so you can only make about 5k-10k session.run calls per second.
This problem is obvious if you are profiling Python (python -m cProfile), but it's hard to see if you are starting with a timeline profile or a processor profile.
A enqueue_many to use enqueue_many to add things to the queue in a batch. I took your test from https://gist.github.com/ericyue/7705407a88e643f7ab380c6658f641e8 and modified it to insert many elements into the .run call, and this gives .run speedup.
The modification is to modify the tf.batch call as follows:
if enqueue_many: reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB)) queue_batch = [] for i in range(enqueue_many_size): _, serialized_example = reader.read(filename_queue) queue_batch.append(serialized_example) batch_serialized_example = tf.train.shuffle_batch( [queue_batch], batch_size=batch_size, num_threads=thread_number, capacity=capacity, min_after_dequeue=min_after_dequeue, enqueue_many=True)
For the full source, check here: https://github.com/yaroslavvb/stuff/blob/master/ericyue-slowreader/benchmark.py
It is difficult to optimize it to go much faster, since now most of the time is spent on operations with the queue. Considering a stripped-down version that simply adds integers to the queue, you also get a similar speed, and looking at the timeline, time is wasted on dequeue ops.

Each dequeue op takes about 60 ΞΌs, but an average of 5 starts in parallel, so you get 12 usec at a time. Thus, this means that at best you will get examples <200,000 per second.
Yaroslav bulatov
source share