Java - multithreaded code does not work faster on other kernels

I just ran multi-threaded code on a 4-core computer in the hope that it would be faster than on a single-core machine. Here is the idea: I got a fixed number of threads (in my case, one thread per core). Each thread executes Runnable forms:

 private static int[] data; // data shared across all threads public void run() { int i = 0; while (i++ < 5000) { // do some work for (int j = 0; j < 10000 / numberOfThreads) { // each thread performs calculations and reads from and // writes to a different part of the data array } // wait for the other threads barrier.await(); } } 

On a quad-core computer, this code works worse with 4 threads than with 1 thread. Even with CyclicBarrier overhead, I would have thought that the code should run at least 2 times faster. Why is it running slower?

EDIT: Here I set about a lively realization of the expectation. Unfortunately, this makes the program run slower on other kernels (also discussed in a separate issue here ):

 public void run() { // do work synchronized (this) { if (atomicInt.decrementAndGet() == 0) { atomicInt.set(numberOfOperations); for (int i = 0; i < threads.length; i++) threads[i].interrupt(); } } while (!Thread.interrupted()) {} } 
+4
source share
5 answers

Adding more threads is not necessarily guaranteed to improve performance. There are a number of possible reasons for performance degradation with additional threads:

  • A coarse-grained lock can overly serialize execution β€” that is, a lock can result in only one thread. You get all the overhead from multiple threads, but none of the benefits. Try to reduce lock time.
  • The same applies to overly frequent barriers and other synchronization structures. If the inner loop j completes quickly, you can spend most of your time on the barrier. Try to do more work between sync points.
  • If your code runs too fast, there can be no time to transfer threads to other CPU cores. This is usually not a problem unless you create a lot of very short threads. Using thread pools or just giving each thread extra work can help. If your threads execute for more than a second or so, this is unlikely to be a problem.
  • If your threads work with a lot of shared read / write data, rolling back the cache line may slow performance. However, although this often leads to poor performance, it is unlikely that performance will be worse than single-threaded. Try to make sure that the data that each stream writes is separate from the data of other streams the size of the cache line (usually about 64 bytes). In particular, there are no output arrays located like [thread A, B, C, D, A, B, C, D ...]

Since you did not show your code, I can no longer speak more here.

+10
source

You sleep nanoseconds instead of milliseconds.

I changed with

 Thread.sleep(0, 100000 / numberOfThreads); // sleep 0.025 ms for 4 threads 

to

 Thread.sleep(100000 / numberOfThreads); 

and got acceleration proportional to the number of running threads , as expected.


I came up with an intensive processor countPrimes " countPrimes ". Full test code here .

I get the following acceleration on my quad machine:

 4 threads: 1625 1 thread: 3747 

(The CPU load monitor really shows that in the first case 4 courses are taken, and in the last case 1 core is taken).

Conclusion You do relatively small parts of the work in each thread between synchronization. Synchronization takes much longer than current CPU calculations.

(In addition, if you have intensive code>, for example, array access arrays in threads, the CPU will still not be the bottleneck, and you will not see any acceleration, dividing it into several processors.)

+4
source

The code inside runnable does virtually nothing.
In your specific example of 4 threads, each thread will sleep for 2.5 seconds and wait for others through the barier.
Thus, all that happens is that each thread receives a processor to increase i , and then blocks sleep access, leaving the processor available.
I don’t understand why the scheduler will allocate each thread to a separate core, since all that happens is what the threads are basically waiting for. It is fair and reasonable to expect that just use the same core and switch between threads
UPDATE
I just saw that you updated the message that some work is happening in the loop. What happens though you do not speak.

+2
source

core synchronization is much slower than synchronization on a single core

because on one server machine, the JVM does not clear the cache (very slow operation) during each synchronization

check out this blog post

+2
source

This is not a proven SpinBarrier, but it should work.

Check to see if this can have any improvement in the case. Since you run the code in a loop, additional synchronization only degrades performance if you have kernels on the go. By the way, I still believe that you have a mistake in intensive work with memory. You can tell which processor + OS you use.

Change, forgot the version.

 import java.util.concurrent.atomic.AtomicInteger; public class SpinBarrier { final int permits; final AtomicInteger count; final AtomicInteger version; public SpinBarrier(int count){ this.count = new AtomicInteger(count); this.permits= count; this.version = new AtomicInteger(); } public void await(){ for (int c = count.decrementAndGet(), v = this.version.get(); c!=0 && v==version.get(); c=count.get()){ spinWait(); } if (count.compareAndSet(0, permits)){;//only one succeeds here, the rest will lose the CAS this.version.incrementAndGet(); } } protected void spinWait() { } } 
+1
source

All Articles