Multiprocessing in Python: Summing with Numpy + Vector & # 8594; Huge slowdown

Please do not despair with a long message. I try to provide as much data as possible, and I really need help in solving the problem: S. I will update daily if there are new tips or ideas.

Problem:

I am trying to run Python code on a dual-core computer in parallel using parallel processes (to avoid GIL), but there is a problem that the code slows down significantly. For example, starting up on a single-core computer takes 600 seconds per workload, but starting up on a dual-core computer takes 1600 s (800 seconds per workload).

What I already tried:

  • I measured the memory and the memory problem does not occur. [just using 20% ​​at the high point].

  • I used "htop" to check if I really run the program on different kernels, or if my main attachment is broken. But no luck, my program runs on all of my cores.

  • The problem is the CPU limiting problem, so I checked and confirmed that my code runs on 100% CPU on all cores, most of the time.

  • I checked the process IDs and I am actually creating two different processes.

  • I changed my function, which Im sent to the executor [e.submit (function, [...])] in the function of calculating-pie and watched a huge acceleration. So the problem is likely in my process_function (...), which I send to the executor, and not in the code before.

  • I am currently using "futures" from "parallel" in order to parallelize a task. But I also tried the pool class of multiprocessing. However, the result has not changed.

Code:

  • Autopsy processes:

    result = [None]*psutil.cpu_count() e = futures.ProcessPoolExecutor( max_workers=psutil.cpu_count() ) for i in range(psutil.cpu_count()): result[i] = e.submit(process_function, ...) 
  • process_function:

     from math import floor from math import ceil import numpy import MySQLdb import time db = MySQLdb.connect(...) cursor = db.cursor() query = "SELECT ...." cursor.execute(query) [...] #save db results into the variable db_matrix (30 columns, 5.000 rows) [...] #save db results into the variable bp_vector (3 columns, 500 rows) [...] #save db results into the variable option_vector( 3 columns, 4000 rows) cursor.close() db.close() counter = 0 for i in range(4000): for j in range(500): helper[:] = (1-bp_vector[j,0]-bp_vector[j,1]-bp_vector[j,2])*db_matrix[:,0] + db_matrix[:,option_vector[i,0]] * bp_vector[j,0] + db_matrix[:,option_vector[i,1]] * bp_vector[j,1] + db_matrix[:,option_vector[i,2]] * bp_vector[j,2] result[counter,0] = (helper < -7.55).sum() counter = counter + 1 return result 

My suggestion:

  • My guess is that for some reason, weighted vector multiplication, creating a vector helper, is causing problems. [I believe time measurement measures this assumption]

  • Could it be that numpy creates these problems? Is numpy compatible with multiprocessing? If not, what can I do? [Already answered in the comments]

  • Could this be due to cache memory? I read about it on the forum, but honestly, I really didn’t understand it. But if the problem took root there, I would make myself familiar with this topic.

Time measurement: (change)

  • Single core: time to get data from db: 8 seconds.

  • Two cores: time to get data from db: 12 seconds.

  • Single core: double cycle time in process_function: ~ 640 s.

  • Two cores: double cycle runtime in process_function: ~ 1600 seconds

Update: (change)

When I measure time with two processes for every 100 I in the cycle, I see that this is approximately 220% of the time that I observe when I measure the same thing, working on only one process. But even more mysterious is that if I quit the process while working, another process will speed up! Then another process is accelerated to the same level as during the solo launch. Thus, between the processes that I just do not see at the moment, there should be some dependencies: S

Update 2: (change)

So, I did some more test runs and measurements. In test runs, I used either a strong> single-core Linux machine (n1-standard-1, 1 vCPU, 3.75 GB of memory) as a computing instance, or a dual-core Linux machine (n1-standard-2, 2 vCPUs, 7, 5 GB of memory) from the Google Compute Engine. However, I also ran tests on my local computer and observed roughly the same results. (-> therefore, a virtualized environment should be beautiful). Here are the results:

PS: The time here is different from the measurements above because I limited the cycle a bit and tested on Google Cloud instead of my home computer.

1-core machine, 1 process started:

time: 225 sec, processor load: ~ 100%

1-core machine, 2 processes started:

time: 557 sec, processor load: ~ 100%

1-core machine, 1 process started, limited to max. CPU usage up to 50%:

time: 488 sec, processor load: ~ 50%

.

2-core machine, 2 processes started:

time: 665 seconds, CPU-1 use: ~ 100%, CPU-2 load: ~ 100%

the process did not skip between cores, each of which used 1 core

(at least htop displayed these results in the Process column)

2-core machine, 1 process started:

time: 222 s, CPU-1 usage: ~ 100% (0%), CPU-2 load: ~ 0% (100%)

however, the process sometimes skipped between cores

2-core machine, 1 process started, limited to max. CPU usage up to 50%:

time: 493 sec, CPU-1 usage: ~ 50% (0%), CPU-2 load: ~ 0% (100%)

however, the process very often jumped between cores

I used htop and the python time module to get these results.

Update - 3: (change)

I used cProfile to profile my code:

 python -m cProfile -s cumtime fun_name.py 

Files are too long to publish here, but I believe that if they contain valuable information at all, this information is probably one of the pinnacles of the source text. Therefore, I will post the first lines of results here:

1-core machine, 1 process started:

 623158 function calls (622735 primitive calls) in 229.286 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.371 0.371 229.287 229.287 20_with_multiprocessing.py:1(<module>) 3 0.000 0.000 225.082 75.027 threading.py:309(wait) 1 0.000 0.000 225.082 225.082 _base.py:378(result) 25 225.082 9.003 225.082 9.003 {method 'acquire' of 'thread.lock' objects} 1 0.598 0.598 3.081 3.081 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.877 0.959 cursors.py:164(execute) 3 0.000 0.000 2.877 0.959 cursors.py:353(_query) 3 0.000 0.000 1.958 0.653 cursors.py:315(_do_query) 3 0.000 0.000 1.943 0.648 cursors.py:142(_do_get_result) 3 0.000 0.000 1.943 0.648 cursors.py:351(_get_result) 3 1.943 0.648 1.943 0.648 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.919 0.306 cursors.py:358(_post_get_result) 3 0.000 0.000 0.917 0.306 cursors.py:324(_fetch_row) 3 0.917 0.306 0.917 0.306 {built-in method fetch_row} 591314 0.161 0.000 0.161 0.000 {range} 

1-core machine, 2 processes started:

 626052 function calls (625616 primitive calls) in 578.086 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.310 0.310 578.087 578.087 20_with_multiprocessing.py:1(<module>) 30 574.310 19.144 574.310 19.144 {method 'acquire' of 'thread.lock' objects} 2 0.000 0.000 574.310 287.155 _base.py:378(result) 3 0.000 0.000 574.310 191.437 threading.py:309(wait) 1 0.544 0.544 2.854 2.854 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.563 0.854 cursors.py:164(execute) 3 0.000 0.000 2.563 0.854 cursors.py:353(_query) 3 0.000 0.000 1.715 0.572 cursors.py:315(_do_query) 3 0.000 0.000 1.701 0.567 cursors.py:142(_do_get_result) 3 0.000 0.000 1.701 0.567 cursors.py:351(_get_result) 3 1.701 0.567 1.701 0.567 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.848 0.283 cursors.py:358(_post_get_result) 3 0.000 0.000 0.847 0.282 cursors.py:324(_fetch_row) 3 0.847 0.282 0.847 0.282 {built-in method fetch_row} 591343 0.152 0.000 0.152 0.000 {range} 

.

2-core machine, 1 process started:

 623164 function calls (622741 primitive calls) in 235.954 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.246 0.246 235.955 235.955 20_with_multiprocessing.py:1(<module>) 3 0.000 0.000 232.003 77.334 threading.py:309(wait) 25 232.003 9.280 232.003 9.280 {method 'acquire' of 'thread.lock' objects} 1 0.000 0.000 232.003 232.003 _base.py:378(result) 1 0.593 0.593 3.104 3.104 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.774 0.925 cursors.py:164(execute) 3 0.000 0.000 2.774 0.925 cursors.py:353(_query) 3 0.000 0.000 1.981 0.660 cursors.py:315(_do_query) 3 0.000 0.000 1.970 0.657 cursors.py:142(_do_get_result) 3 0.000 0.000 1.969 0.656 cursors.py:351(_get_result) 3 1.969 0.656 1.969 0.656 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.794 0.265 cursors.py:358(_post_get_result) 3 0.000 0.000 0.792 0.264 cursors.py:324(_fetch_row) 3 0.792 0.264 0.792 0.264 {built-in method fetch_row} 591314 0.144 0.000 0.144 0.000 {range} 

2-core machine, 2 processes started:

 626072 function calls (625636 primitive calls) in 682.460 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.334 0.334 682.461 682.461 20_with_multiprocessing.py:1(<module>) 4 0.000 0.000 678.231 169.558 threading.py:309(wait) 33 678.230 20.552 678.230 20.552 {method 'acquire' of 'thread.lock' objects} 2 0.000 0.000 678.230 339.115 _base.py:378(result) 1 0.527 0.527 2.974 2.974 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.723 0.908 cursors.py:164(execute) 3 0.000 0.000 2.723 0.908 cursors.py:353(_query) 3 0.000 0.000 1.749 0.583 cursors.py:315(_do_query) 3 0.000 0.000 1.736 0.579 cursors.py:142(_do_get_result) 3 0.000 0.000 1.736 0.579 cursors.py:351(_get_result) 3 1.736 0.579 1.736 0.579 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.975 0.325 cursors.py:358(_post_get_result) 3 0.000 0.000 0.973 0.324 cursors.py:324(_fetch_row) 3 0.973 0.324 0.973 0.324 {built-in method fetch_row} 5 0.093 0.019 0.304 0.061 __init__.py:1(<module>) 1 0.017 0.017 0.275 0.275 __init__.py:106(<module>) 1 0.005 0.005 0.198 0.198 add_newdocs.py:10(<module>) 591343 0.148 0.000 0.148 0.000 {range} 

I personally do not know what to do with these results. I would be glad to receive tips, tricks or any other help - thanks :)

Reply to answer-1: (edit)

Roland Smith looked at the data and suggested that multiprocessing could hurt performance more than it helps. Therefore, I made another measurement without multiprocessing (for example, the code that he proposed):

Did I conclude correctly that this is not so? Since measured times seem like time measured before multiprocessing?

1-core machine:

Access to the database took 2.53 seconds

Matrix manipulation took 236.71 seconds

 1842384 function calls (1841974 primitive calls) in 241.114 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 219.036 219.036 241.115 241.115 20_with_multiprocessing.py:1(<module>) 406000 0.873 0.000 18.097 0.000 {method 'sum' of 'numpy.ndarray' objects} 406000 0.502 0.000 17.224 0.000 _methods.py:31(_sum) 406001 16.722 0.000 16.722 0.000 {method 'reduce' of 'numpy.ufunc' objects} 1 0.587 0.587 3.222 3.222 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.964 0.988 cursors.py:164(execute) 3 0.000 0.000 2.964 0.988 cursors.py:353(_query) 3 0.000 0.000 1.958 0.653 cursors.py:315(_do_query) 3 0.000 0.000 1.944 0.648 cursors.py:142(_do_get_result) 3 0.000 0.000 1.944 0.648 cursors.py:351(_get_result) 3 1.944 0.648 1.944 0.648 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 1.006 0.335 cursors.py:358(_post_get_result) 3 0.000 0.000 1.005 0.335 cursors.py:324(_fetch_row) 3 1.005 0.335 1.005 0.335 {built-in method fetch_row} 591285 0.158 0.000 0.158 0.000 {range} 

dual core machine:

Access to the database took 2.32 seconds

Matrix manipulation took 242.45 seconds

 1842390 function calls (1841980 primitive calls) in 246.535 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 224.705 224.705 246.536 246.536 20_with_multiprocessing.py:1(<module>) 406000 0.911 0.000 17.971 0.000 {method 'sum' of 'numpy.ndarray' objects} 406000 0.526 0.000 17.060 0.000 _methods.py:31(_sum) 406001 16.534 0.000 16.534 0.000 {method 'reduce' of 'numpy.ufunc' objects} 1 0.617 0.617 3.113 3.113 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.789 0.930 cursors.py:164(execute) 3 0.000 0.000 2.789 0.930 cursors.py:353(_query) 3 0.000 0.000 1.938 0.646 cursors.py:315(_do_query) 3 0.000 0.000 1.920 0.640 cursors.py:142(_do_get_result) 3 0.000 0.000 1.920 0.640 cursors.py:351(_get_result) 3 1.920 0.640 1.920 0.640 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.851 0.284 cursors.py:358(_post_get_result) 3 0.000 0.000 0.849 0.283 cursors.py:324(_fetch_row) 3 0.849 0.283 0.849 0.283 {built-in method fetch_row} 591285 0.160 0.000 0.160 0.000 {range} 
+6
source share
1 answer

Your programs seem to spend most of their time acquiring locks. This, apparently, indicates that in your case multiprocessing hurts more than it helps.

Remove all multiprocessor materials and start measuring how long it will take without it. For instance. like this.

 from math import floor from math import ceil import numpy import MySQLdb import time start = time.clock() db = MySQLdb.connect(...) cursor = db.cursor() query = "SELECT ...." cursor.execute(query) stop = time.clock() print "Database access took {:.2f} seconds".format(stop - start) start = time.clock() [...] #save db results into the variable db_matrix (30 columns, 5.000 rows) [...] #save db results into the variable bp_vector (3 columns, 500 rows) [...] #save db results into the variable option_vector( 3 columns, 4000 rows) stop = time.clock() print "Creating matrices took {:.2f} seconds".format(stop - start) cursor.close() db.close() counter = 0 start = time.clock() for i in range(4000): for j in range(500): helper[:] = (1-bp_vector[j,0]-bp_vector[j,1]-bp_vector[j,2])*db_matrix[:,0] + db_matrix[:,option_vector[i,0]] * bp_vector[j,0] + db_matrix[:,option_vector[i,1]] * bp_vector[j,1] + db_matrix[:,option_vector[i,2]] * bp_vector[j,2] result[counter,0] = (helper < -7.55).sum() counter = counter + 1 stop = time.clock() print "Matrix manipulation took {:.2f} seconds".format(stop - start) 

Edit-1

Based on your measurements, I adhere to my conclusion (in a slightly paraphrased form) that on a multi-core machine, using multiprocessing , as you are doing now, greatly degrades your performance. On a dual-core machine, a program with multiprocessing takes much longer than without it!

That there is no difference between using multiprocessing or not when using a single-core machine is not very appropriate, I think. In any case, a single-core machine will not see many of the benefits of multiprocessing.

New measurements show that most of the time is spent on manipulating the matrices. This is logical since you are using an explicit nested for loop, which is not very fast.

In principle, there are four possible solutions:

First, re-write the nested loop in numpy operations. Numpy operations have implicit loops (written in C) instead of explicit loops in Python and therefore faster. (A rare case where explicit is worse than implicit. ;-)) The disadvantage is that it is likely to use a significant amount of memory.

The second option is to separate the helper calculations, which consist of 4 parts. Run each part in a separate process and add the results together at the end. This does some overhead; each process should extract all the data from the database and transfer the partial result back to the main process (perhaps also through the database?).

A third option would be to use pypy instead of Cpython . It can be significantly faster.

A fourth option would be to rewrite critical matrix manipulations in Cython or C.

+2
source

All Articles