Multiprocessor Python code is slower with 32 cores than 16 cores on AWS EC2

I do not understand why the time of my calculations is longer when I use 28-30 cores than when I use 12-16 cores on AWS EC2 c3.8xlarge. I did some tests, and the result is shown in the table below:

https://www.dropbox.com/s/8u32jttxmkvnacd/Slika%20zaslona%202015-01-11%20u%2018.33.20.png?dl=0

The fastest calculation is when I use 13 cores. Therefore, if I use the maximum cores, then at the same time, when I use 8 c3.8xlarge cores:

https://www.dropbox.com/s/gf3bevbi8dwk5vh/Slika%20zaslona%202015-01-11%20u%2018.32.53.png?dl=0

This is the simplified code that I use.

import random import multiprocessing as mp import threading as th import numpy as np x=mp.Value('f',0) y=mp.Value('f',0) arr=[] tasks=[] nesto=[] def calculation2(some_array): global x, y, arr p=False a = np.sum(some_array)*random.random() b = a **(random.random()) if a > x.value: x.value=a y.value=b arr=some_array p=True if p: return x.value, y.value, arr def calculation1(number_of_pool): global tasks pool=mp.Pool(number_of_pool) for i in range(1,500): some_array=np.random.randint(100, size=(1, 4)) tasks+=[pool.apply_async(calculation2,args=(some_array,))] def exec_activator(): global x, y, arr while tasks_gen.is_alive() or len(tasks)>0: try: task=tasks.pop(0) x.value, y.value, arr = task.get() except: pass def results(task_act): while task_act.is_alive(): pass else: print x.value print y.value print arr tasks_gen=th.Thread(target=calculation1,args=(4,)) task_act=th.Thread(target=exec_activator) result_print=th.Thread(target=results,args=(task_act,)) tasks_gen.start() task_act.start() result_print.start() 

Its core is 2 calculations:

  • calculation 1 - computational array and task for calculation 2 with this array
  • calculation 2 - calculation of some calculations from the array and comparison of the results

The purpose of the code is to find an array that calculates the maximum of x and return its y. Two calculations start at the same time (with the stream), because sometimes there are too many arrays that take up too much RAM.

My goal is to do the fastest calculation. I need advice on how to use all cores, if possible.

Sorry if bad english. If you need more information, please ask.

+5
source share
1 answer

C3.8xlarge is an Ivy Bridge quad-core system. It uses hyper-threading; it does not have 32 (hardware) independent processing units.

Often it makes no sense to try parallelism tasks with a processor binding in most OS processes than their processors in hardware. In fact, quite often this is harmful due to lack of resources and context switching (this is what you see).

It probably depends on your specific applications, and experiments will help you find a sweet spot (which sounds like you did).

+2
source

Source: https://habr.com/ru/post/1210862/


All Articles