I want you to update your split_matrix method, since it returns one dividing range less than you want (setting cpu_cnt=4 will only return 3 tuples, not 4 ):
def split_matrix(k, n): split_points = [round(i * k / n) for i in range(n+1)] return [(split_points[i], split_points[i + 1],) for i in range(len(split_points) - 1)]
Edit: if your data localization is not such a line, you can try this: create a queue task in which you add all the indexes / records for which this calculation should be performed. Then you initialize your parallel workers (for example, using multiprocessing ) and start them. This worker now selects an element from the queue , calculates the result, saves it (for example, in another queue ) and continues the next element, etc.
If this does not work for your data, I do not think you can improve it.
Flashtek
source share