Parallelism does not reduce time on a dataset map

The TF Map function supports concurrent calls . I do not see improvements passing num_parallel_calls to display. With values num_parallel_calls=1 and num_parallel_calls=10 , the performance runtime does not improve. Here is a simple code

 import time def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, batch_size=1, repeat=1, num_iterations=10): tf.reset_default_graph() start = time.time() dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func( squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) if batch: dataset_x = dataset_x.batch(batch_size) dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func( squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) if batch: dataset_y = dataset_x.batch(batch_size) X = dataset_x.make_one_shot_iterator().get_next() Y = dataset_x.make_one_shot_iterator().get_next() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) i = 0 while True: try: res = sess.run([X, Y]) i += 1 if i == num_iterations: break except tf.errors.OutOfRangeError as e: pass 

Below are the timings

 %timeit test_two_custom_function_parallelism(num_iterations=1000, num_parallel_calls=2, batch_size=2, batch=True) 370ms %timeit test_two_custom_function_parallelism(num_iterations=1000, num_parallel_calls=5, batch_size=2, batch=True) 372ms %timeit test_two_custom_function_parallelism(num_iterations=1000, num_parallel_calls=10, batch_size=2, batch=True) 384ms 

I used %timeit on a Juypter laptop. What am I doing wrong?

+3
tensorflow tensorflow-datasets
source share
1 answer

The problem is that the only operation in the Dataset.map() function is tf.py_func() op. This operating system calls the local Python interpreter to run the function in the same process. Increasing num_parallel_calls will increase the number of TensorFlow threads that will try to call back to Python at the same time. However, Python has something called the "Global Interpreter Lock" that prevents multiple threads from executing simultaneously. As a result, all but one of these several concurrent calls will be blocked, waiting for Global Interpreter Lock to be received, and there will be almost no parallel acceleration (and possibly even a slight slowdown).

Your code example did not include the definition of the squarer() function, but you could replace tf.py_func() pure TensorFlow operations that are implemented in C ++ and can be executed in parallel. For example, mdash and just guessing the name - you can replace it with a call to tf.square(x) , and then you may like parallel acceleration.

Please note that if the amount of work in the function is small, for example, squaring a single integer, the acceleration may not be very large. Parallel Dataset.map() more useful for heavier operations, such as parsing TFRecord using tf.parse_single_example() or performing some image distortion as part of a data extension pipeline.

+4
source share

All Articles