Fork optimization

Question

Fork optimization

What I want

I want to work on optimizing the fork / join algorithm. By optimization I mean simply calculating the optimal number of threads, or if you want, calculating SEQUENTIAL_THRESHOLD (see Code below).

 // PSEUDOCODE Result solve(Problem problem) { if (problem.size < SEQUENTIAL_THRESHOLD) return solveSequentially(problem); else { Result left, right; INVOKE-IN-PARALLEL { left = solve(extractLeftHalf(problem)); right = solve(extractRightHalf(problem)); } return combine(left, right); } }

How do I imagine that

For example, I want to calculate the product of a large array. Then I just evaluate all the components and get the optimal number of threads:

SEQUENTIAL_THRESHOLD = PC * IS / MC (example only)

PC - the number of processor cores; IS is a constant indicating the optimal size of an array with one processor core and the simplest data operation (for example, reading); MC - multiply the cost of the operation;

Suppose that MC = 15; PC = 4 and IS = 10000; SEQUENTIAL_THRESHOLD = 2667 . The larger the array of subtasks than 2667, I will unlock it.

Broad questions

Is it possible to make the form SEQUENTIAL_THRESHOLD in this way?
Is it possible to do the same for more complex calculations: not only for operations on arrays / collections and sorting?

The narrow question is:

Already there are some studies on calculating SEQUENTIAL_THRESHOLD for arrays / collections / sorting? How do they achieve this?

Updated March 7, 2014:

If it is not possible to write one formula for calculating the threshold, can I write a utility that will perform predefined tests on the PC, and not get the optimal threshold? Is it also impossible or not?
What can Java 8 Streams API implement? Can this help me? Is the Java 8 Streams API eliminating the need for fork / join?

+7

java multithreading concurrency java-8 fork-join

Volodymyr bakhmatiuk Mar 05 '14 at 7:32

source share

3 answers

edharned · Answer 1 · 2014-03-05T15:22:29+0000

Absolutely, positively, it is impossible to calculate the correct threshold if you are not close to the execution environment. I support the fork / join project on sourceforge.net, and this is the code that I use in most of the built-in functions:

 private int calcThreshold(int nbr_elements, int passed_threshold) { // total threads in session // total elements in array int threads = getNbrThreads(); int count = nbr_elements + 1; // When only one thread, it doesn't pay to decompose the work, // force the threshold over array length if (threads == 1) return count; /* * Whatever it takes * */ int threshold = passed_threshold; // When caller suggests a value if (threshold > 0) { // just go with the caller suggestion or do something with the suggestion } else { // do something usful such as using about 8 times as many tasks as threads or // the default of 32k int temp = count / (threads << 3); threshold = (temp < 32768) ? 32768 : temp; } // endif // whatever return threshold; }

Edit March 9th:

How can you have a general utility that can know not only processor speed, available memory, number of processors, etc. (physical environment), but also the intention of the software? The answer is that you cannot. That is why you need to develop a routine for each environment. The above method is what I use for base arrays (vectors.) I use another for more matrix processing:

 // When very small, just spread every row if (count < 6) return 1; // When small, spread a little if (count < 30) return ((count / (threads << 2) == 0)? threads : (count / (threads << 2))); // this works well for now return ((count / (threads << 3) == 0)? threads : (count / (threads << 3)));

As for Java8 threads: they use the F / J framework under the hood, and you cannot specify a threshold.

Twothe · Answer 2 · 2014-03-05T15:16:26+0000

You cannot weld this to a simple formula for several reasons:

Each PC will have very different parameters, depending not only on the kernel, but also on other factors, such as time synchronization or background tasks.
Java itself optimizes loops on the go at run time. Thus, instant perfect tuning may not be optimal after a few seconds. Or worse: tuning can prevent full optimization.

The only way I can see is to dynamically adjust the values in some form of AI or genetic algorithm. However, this includes that the program often checks for suboptimal settings only to determine if the current setting is even better. Therefore, it is doubtful if the speed were actually higher than the lost speed when trying other settings. In the end, it is probably only a solution at the initial stage of training, while further executions then use these learning values as fixed numbers.

Since this not only takes time, but also significantly increases the complexity of the code, I do not think that this is an option for most programs. It is often more beneficial not to even use Fork-Join in the first place, as there are many other parallelization options that may be better suited to the problem.

The idea for a “genetic” algorithm would be to measure the loop efficiency for each run, and then in the background the hash map loop-parameters -> execution time , which is constantly updated, and for most runs the fastest setting is selected.

Sourabh bhat · Answer 3 · 2014-03-05T10:10:20+0000

This is a very interesting problem to study. I wrote this simple code to check the optimal value of the serial threshold. I could not come to any specific conclusions, although most likely because I am running it on an old laptop with two processors. The only consistent observation after many runs was that the time spent quickly drops to a consistent threshold of 100. Try running this code and let me know what you find. Also at the bottom, I added a python script to plot the results so that we can visually see the trend.

 import java.io.FileWriter; import java.util.concurrent.ForkJoinPool; import java.util.concurrent.RecursiveAction; public class Testing { static int SEQ_THRESHOLD; public static void main(String[] args) throws Exception { int size = 100000; int[] v1 = new int[size]; int[] v2 = new int[size]; int[] v3 = new int[size]; for (int i = 0; i < size; i++) { v1[i] = i; // Arbitrary initialization v2[i] = 2 * i; // Arbitrary initialization } FileWriter fileWriter = new FileWriter("OutTime.dat"); // Increment SEQ_THRESHOLD and save time taken by the code to run in a file for (SEQ_THRESHOLD = 10; SEQ_THRESHOLD < size; SEQ_THRESHOLD += 50) { double avgTime = 0.0; int samples = 5; for (int i = 0; i < samples; i++) { long startTime = System.nanoTime(); ForkJoinPool fjp = new ForkJoinPool(); fjp.invoke(new VectorAddition(0, size, v1, v2, v3)); long endTime = System.nanoTime(); double secsTaken = (endTime - startTime) / 1.0e9; avgTime += secsTaken; } fileWriter.write(SEQ_THRESHOLD + " " + (avgTime / samples) + "\n"); } fileWriter.close(); } } class VectorAddition extends RecursiveAction { int[] v1, v2, v3; int start, end; VectorAddition(int start, int end, int[] v1, int[] v2, int[] v3) { this.start = start; this.end = end; this.v1 = v1; this.v2 = v2; this.v3 = v3; } int SEQ_THRESHOLD = Testing.SEQ_THRESHOLD; @Override protected void compute() { if (end - start < SEQ_THRESHOLD) { // Simple vector addition for (int i = start; i < end; i++) { v3[i] = v1[i] + v2[i]; } } else { int mid = (start + end) / 2; invokeAll(new VectorAddition(start, mid, v1, v2, v3), new VectorAddition(mid, end, v1, v2, v3)); } } }

and here is a Python script to build the results:

 from pylab import * threshold = loadtxt("./OutTime.dat", delimiter=" ", usecols=(0,)) timeTaken = loadtxt("./OutTime.dat", delimiter=" ", usecols=(1,)) plot(threshold, timeTaken) show()

Fork optimization

More articles: