Why is this Python script running 4 times slower on multiple cores than on one core

Question

Why is this Python script running 4 times slower on multiple cores than on one core

I am trying to understand how CPython GIL works, and what are the differences between GIL in CPython 2.7.x and CPython 3.4.x. I use this code for benchmarking:

from __future__ import print_function import argparse import resource import sys import threading import time def countdown(n): while n > 0: n -= 1 def get_time(): stats = resource.getrusage(resource.RUSAGE_SELF) total_cpu_time = stats.ru_utime + stats.ru_stime return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime def get_time_diff(start_time, end_time): return tuple((end-start) for start, end in zip(start_time, end_time)) def main(total_cycles, max_threads, no_headers=False): header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" % ("#t", "seq_r", "seq_c", "seq_u", "seq_s", "par_r", "par_c", "par_u", "par_s")) row_format = ("%(threads)4d " "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f " "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f") if not no_headers: print(header) for thread_count in range(1, max_threads+1): # We don't care about a few lost cycles cycles = total_cycles // thread_count threads = [threading.Thread(target=countdown, args=(cycles,)) for i in range(thread_count)] start_time = get_time() for thread in threads: thread.start() thread.join() end_time = get_time() sequential = get_time_diff(start_time, end_time) threads = [threading.Thread(target=countdown, args=(cycles,)) for i in range(thread_count)] start_time = get_time() for thread in threads: thread.start() for thread in threads: thread.join() end_time = get_time() parallel = get_time_diff(start_time, end_time) print(row_format % {"threads": thread_count, "seq_r": sequential[0], "seq_c": sequential[1], "seq_u": sequential[2], "seq_s": sequential[3], "par_r": parallel[0], "par_c": parallel[1], "par_u": parallel[2], "par_s": parallel[3]}) if __name__ == "__main__": arg_parser = argparse.ArgumentParser() arg_parser.add_argument("max_threads", nargs="?", type=int, default=5) arg_parser.add_argument("total_cycles", nargs="?", type=int, default=50000000) arg_parser.add_argument("--no-headers", action="store_true") args = arg_parser.parse_args() sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

When I run this script on my i5-2500 quad-core processor on Ubuntu 14.04 with Python 2.7.6, I get the following results (_r means real time, _c for processor time, _u for user mode, _s for kernel mode):

  #t seq_r seq_c seq_u seq_s par_r par_c par_u par_s 1 1.47 1.47 1.47 0.00 1.46 1.46 1.46 0.00 2 1.74 1.74 1.74 0.00 3.33 5.45 3.52 1.93 3 1.87 1.90 1.90 0.00 3.08 6.42 3.77 2.65 4 1.78 1.83 1.83 0.00 3.73 6.18 3.88 2.30 5 1.73 1.79 1.79 0.00 3.74 6.26 3.87 2.39

Now, if I bind all threads to one core, the results will be different:

 taskset -c 0 python countdown.py #t seq_r seq_c seq_u seq_s par_r par_c par_u par_s 1 1.46 1.46 1.46 0.00 1.46 1.46 1.46 0.00 2 1.74 1.74 1.73 0.00 1.69 1.68 1.68 0.00 3 1.47 1.47 1.47 0.00 1.58 1.58 1.54 0.04 4 1.74 1.74 1.74 0.00 2.02 2.02 1.87 0.15 5 1.46 1.46 1.46 0.00 1.91 1.90 1.75 0.15

So, the question arises: why run this Python code on multiple cores 1.5x-2x slower than a wall clock and 4x-5x slower on a processor clock, than run it on a single core?

Having asked around and the search query, I deduced two hypotheses:

When running on multiple cores, a thread can be rescheduled to run on a different core, which means that the local cache becomes invalid, therefore it slows down.
The overhead of suspending a thread on one core and activating it on another kernel is greater than suspending and activating a thread on one core.

Are there any other reasons? I would like to understand what is happening, and to be able to support my understanding with numbers (this means that if the slowdown occurs due to misses in the cache, I want to see and compare the numbers for both cases).

+7

python multithreading cpython gil

user108884 Jul 13 '15 at 14:40

source share

2 answers

mshildt · Answer 1 · 2015-07-14T11:22:17+0000

This is because the GIL beats when several of its own threads compete for the GIL. David Basley's content on this subject will tell you everything you want to know.

See the information here for a nice graphical representation of what is happening.

Python3.2 introduced changes to the GIL that help solve this problem, so you should see improved performance with 3.2 and later.

It should also be noted that GIL is a detail of the implementation of the reference implementation of the cpython language. Other implementations, such as Jython, do not have a GIL and do not suffer from this particular problem.

Other information about D. Beazley on the GIL will also be useful to you.

To answer your question about why performance is much worse when multiple cores are involved, see slide 29-41 Inside the GIL presentation. It discusses in detail the multi-core rivalry of GIL, as opposed to multiple threads on the same core. Slide 32 specifically shows that the number of system calls due to overhead in the stream goes through the roof when adding cores. This is due to the fact that threads now work simulatively on different cores and allow them to participate in a real GIL battle. Unlike multiple threads using one processor. A good summary mark from the presentation above:

With multiple cores, threads associated with a processor are scheduled simultaneously (on different cores), and then a GIL battle.

Aaron digulla · Answer 2 · 2015-07-14T11:24:30+0000

GIL prevents multiple python threads from executing simultaneously. This means that every time one thread must execute Python bytecode (internal code representation), it will receive a lock (effectively stopping other threads on other kernels). For this to work, the processor needs to clear all cache lines. Otherwise, the active thread will work with outdated data.

When you run threads on the same CPU, there is no need for caching.

This should explain most of the slowdown. If you want to run Python code in parallel, you need to use processes and IPC (sockets, semaphores, memory-mapped IOs). But this can be slow for various reasons (memory must be copied between processes).

Another approach is to move more code in the C library, which does not hold the GIL during its execution. This would allow more code to run in parallel.

Why is this Python script running 4 times slower on multiple cores than on one core

More articles: