Strange behavior when scaling processing across multiple processors

Question

Strange behavior when scaling processing across multiple processors

I study performance when scaling Java code on many processors. To do this, I wrote a simple program that uses 50,000 Fibonacci on one thread, then 2 * 50,000 on two threads, 3 * 50,000 on three threads, and so on, until the number of CPUs of the target node is reached.

Here is my code:

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class MultiThreadScalability {

    static final int MAX_THREADS = 4;
    static final int NB_RUN_PER_THREAD = 50000;
    static final int FIBO_VALUE = 25;

    public static void main(String[] args) {
        MultiThreadScalability multiThreadScalability = new MultiThreadScalability();
        multiThreadScalability.runTest();
    }


    private void runTest() {
        int availableProcs = Runtime.getRuntime().availableProcessors();
        System.out.println(availableProcs + " processors available");

        for (int i = 1 ; i <= availableProcs ; i++) {
            System.out.println("Running scalability test for " + i + " threads");
            long timeInMillisecs = runTestForThreads(i);
            System.out.println("=> " + timeInMillisecs + " milli-seconds");
        }
    }


    private long runTestForThreads(int threadsNumber) {
        final int nbRun = NB_RUN_PER_THREAD * threadsNumber;
        ExecutorService executor = Executors.newFixedThreadPool(threadsNumber);

        long startTime = System.currentTimeMillis();

        for (int i = 0 ; i < nbRun ; i++) {
            Runnable worker = new Runnable()
            {
                public void run()
                {
                    fibo(FIBO_VALUE);
                }
            };

            executor.execute(worker);
        }

        executor.shutdown();

        while (!executor.isTerminated())
        {}

        return (System.currentTimeMillis() - startTime);
    }


    private static long fibo(int n) {
        if (n < 2) {
            return (n);
        }

        return (fibo(n - 1) + fibo(n - 2));
    }

}

Under these conditions, I expected that - regardless of the number of threads - the runtime remains constant.

I ran it on an all-wheel drive car, and I had the following output:

48 processors available
Running scalability test for 1 threads
=> 34199 milli-seconds
Running scalability test for 2 threads
=> 34141 milli-seconds
Running scalability test for 3 threads
=> 34009 milli-seconds
Running scalability test for 4 threads
=> 34000 milli-seconds
Running scalability test for 5 threads
=> 34034 milli-seconds
Running scalability test for 6 threads
=> 34086 milli-seconds
Running scalability test for 7 threads
=> 34094 milli-seconds
Running scalability test for 8 threads
=> 34673 milli-seconds
Running scalability test for 9 threads
=> 35297 milli-seconds
Running scalability test for 10 threads
=> 35486 milli-seconds
Running scalability test for 11 threads
=> 35913 milli-seconds
Running scalability test for 12 threads
=> 36324 milli-seconds
Running scalability test for 13 threads
=> 35722 milli-seconds
Running scalability test for 14 threads
=> 35750 milli-seconds
Running scalability test for 15 threads
=> 35634 milli-seconds
Running scalability test for 16 threads
=> 35970 milli-seconds
Running scalability test for 17 threads
=> 37914 milli-seconds
Running scalability test for 18 threads
=> 36560 milli-seconds
Running scalability test for 19 threads
=> 36720 milli-seconds
Running scalability test for 20 threads
=> 37028 milli-seconds
Running scalability test for 21 threads
=> 37381 milli-seconds
Running scalability test for 22 threads
=> 37529 milli-seconds
Running scalability test for 23 threads
=> 37632 milli-seconds
Running scalability test for 24 threads
=> 39942 milli-seconds
Running scalability test for 25 threads
=> 40090 milli-seconds
Running scalability test for 26 threads
=> 41238 milli-seconds
Running scalability test for 27 threads
=> 42336 milli-seconds
Running scalability test for 28 threads
=> 43377 milli-seconds
Running scalability test for 29 threads
=> 44394 milli-seconds
Running scalability test for 30 threads
=> 46245 milli-seconds
Running scalability test for 31 threads
=> 45928 milli-seconds
Running scalability test for 32 threads
=> 47490 milli-seconds
Running scalability test for 33 threads
=> 47674 milli-seconds
Running scalability test for 34 threads
=> 48775 milli-seconds
Running scalability test for 35 threads
=> 56456 milli-seconds
Running scalability test for 36 threads
=> 50557 milli-seconds
Running scalability test for 37 threads
=> 51393 milli-seconds
Running scalability test for 38 threads
=> 52971 milli-seconds
Running scalability test for 39 threads
=> 53077 milli-seconds
Running scalability test for 40 threads
=> 54015 milli-seconds
Running scalability test for 41 threads
=> 55924 milli-seconds
Running scalability test for 42 threads
=> 55560 milli-seconds
Running scalability test for 43 threads
=> 56554 milli-seconds
Running scalability test for 44 threads
=> 57073 milli-seconds
Running scalability test for 45 threads
=> 65193 milli-seconds
Running scalability test for 46 threads
=> 58549 milli-seconds
Running scalability test for 47 threads
=> 59302 milli-seconds
Running scalability test for 48 threads
=> 60662 milli-seconds

Time stays up to almost 24 threads. It gets slower and slower. You can see it on this graph.

I ask for help to understand why such a “gap” occurs.

, : , , :

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 46
model name      : Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
stepping        : 6
cpu MHz         : 1997.885
cache size      : 18432 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 6
apicid          : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat p
se36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc id
a nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lah
f_lm
bogomips        : 3995.77
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management: [8]

, 6. Runtime.getRuntime(). availableProcessors() pysical CPU, "": 48

, "", 24 ?

+4

java performance scalability

Philippe MESMEUR 07 . '16 6:40

4

Codo · Answer 1 · 2016-03-07T07:04:19+0000

, 4 Intel E7540, 6 12 , 24 48 . , 24 .

48 hyperthreading, -, , . - , 24 .

, , .

hansod1 · Answer 2 · 2016-03-07T07:04:19+0000

, adhoc . :

java ? JVM?
CPU, , n = 25. , JVM , . , ! , scrypt /dev/urandom.
? 1 , , , . 10 20 , ( , , , , ).
! ALU, , , , , , (, Intell , ), .
, , . , - ( , , ).

Scary Wombat · Answer 3 · 2016-03-07T06:55:45+0000

Threading FIBO CPU, .

, , , , , .

- , , , , CPU.

Fabio fernandes · Answer 4 · 2017-12-06T20:59:09+0000

1 , 2 " ", .

, 24 48 ( lscpu)

.

, .

, , L1 , / .

Typically, hyperthreading only gives an advantage if the processor spends good latency on (cached) memory.

The OS should try to avoid loading the load on the hyperthreading brother when there are kernel cores that can be used.

In your example, the processor is heavily loaded, and as soon as the OS starts scheduling two threads of software on the same physical core of the processor, then two threads will fight for the same execution units, reducing scalability.

Strange behavior when scaling processing across multiple processors

More articles: