Multi-thread memory access

I am writing a multi-threaded java application that runs on a Nehalem processor. However, I have a problem: starting from 4 threads, I almost see no acceleration in my application.

I did some simple tests. I created a stream that just allocates a large array and makes access to random entries in the array. Therefore, when I start the number of threads, the operating time should not change (provided that I did not exceed the number of available processor cores). But I noticed that starting 1 or 2 threads takes almost the same time, but running 4 or 8 threads is much slower. Therefore, before trying to solve the problem of algorithmic and synchronization in my application, I want to find out that the maximum possible parallelization that I can achieve is possible.

I used the -XX:+UseNUMA JVM option, so arrays should be allocated in memory next to the corresponding threads.

PS If the threads performed a simple mathematical calculation, there was no time for 4 or even 8 threads, so I came to the conclusion that when the threads access memory, I have some problems.

Any help or ideas are appreciated, thanks.


EDIT

Thanks everyone for the answers. I see that I have not explained myself well enough.

Before trying to fix the synchronization problems in my application, I did a simple test that checks the best possible parallelization that can be achieved. The code is as follows:

 public class TestMultiThreadingArrayAccess { private final static int arrSize = 40000000; private class SimpleLoop extends Thread { public void run() { int array[] = new int[arrSize]; for (long i = 0; i < arrSize * 10; i++) { array[(int) ((i * i) % arrSize)]++; // randomize a bit the access to the array } long sum = 0; for (int i = 0; i < arrSize; i++) sum += array[i]; } } public static void main(String[] args) { TestMultiThreadingArrayAccess test = new TestMultiThreadingArrayAccess(); for (int threadsNumber : new int[] { 1, 2, 4, 8 }) { Statistics timer = new Statistics("Executing " + threadsNumber+ " threads"); // Statistics is a simple helper class that measures the times timer.start(); test.doTest(threadsNumber); timer.stop(); System.out.println(timer.toString()); } } public void doTest(int threadsNumber) { Thread threads[] = new Thread[threadsNumber]; for (int i = 0; i < threads.length; i++) { threads[i] = new SimpleLoop(); threads[i].start(); } for (int i = 0; i < threads.length; i++) try { threads[i].join(); } catch (InterruptedException e) { }; } } 

So, you see that in this ministry there is no synchronization, nor is the distribution of the array inside the stream, so it should be placed in a piece of memory that can be quickly accessed. There are also no memory statements in this code. Nevertheless, for 4 threads in the process, there is a decrease of 30%, and 8 threads work twice as slow. Since you are from code, I just wait until all threads finish their work, and since their work is independent, the number of threads should not affect the total execution time.

There are 2 quad-core Nehalem hyperprocessors installed on the machine (16 processors in total), so with 8 threads everyone can capture it exclusively by the CPU.

When I tried to run this test with a smaller array (20 thousand records), the drop in execution time from 4 threads was 7% and 8 threads - 14%, which is satisfactory. But when I try to use random access to a large array (40M records), the operating time increases dramatically, so I think that there is a problem that large chunks of memory (because they do not fit into the cache?) Are available in non- effective way.

Any ideas how to fix this?

Hope this clears up the question better, thanks again.

+4
source share
6 answers

The bottleneck in the test is the processor with memory. Even if local memory is available, it will be used by multiple threads. (The memory is local to the node, not the specific kernel.) As soon as the CPU can easily exceed the available bandwidth for a simple loop such as your test above, and therefore increasing the threads on such a test will not improve performance and may degrade performance due to degradation cache coherency.

Just a sensitivity test, are you also using a parallel collector? -XX:+UseParallelGC . UseNUMA takes effect only then.

+2
source

Not knowing what you are doing and what the problem is that you are trying to solve. It looks like you have strong synchronization around your code, as this may be the main reason that you are not scalable enough. Compared to synchronization, you can slow down any acceleration as soon as the application becomes almost serial. So my suggestion for you is to check your implementation and try to figure it out.

ADD.

After you have added your implementation of what you are doing. The decrease in performance can be explained by large and massive access to memory. After you start all your threads, and they need to access the memory controller for non-cached data, since they work on different processors, the memory controller does not allow the CPU to do this at the same time, that is, synchronization occurs at the hardware level at each hardware level . In this case, you are almost equal, as if you were running 10 different independent programs. I think if you run 10 (you can replace 10 with any large number) it copies your web browser, for example, you will see the same effect, but this does not mean that the browser implementation is inefficient, you just create a huge load on the computer's memory.

+1
source

As Artem notes, it is possible that you do not need synchronization. But I will start by establishing the facts. Is your application REALLY running slower as you describe?

Here's a insightful article on the topic: http://codeidol.com/java/java-concurrency/Testing-Concurrent-Programs/Avoiding-Performance-Testing-Pitfalls/

It is actually quite difficult to write useful micro-tests, especially when you are dealing with parallel code. For example, you might have β€œDead code elim,” in which the compiler optimizes the code that you think is executing. It's also hard to guess when garbage collection starts. Optimization of Hotspot runtimes also makes measurement difficult. In the case of threads, you need to consider the time it takes to create them. Thus, you may need to use "CyclicBarrier", etc. For accurate measurement. Such things..

Having said that, it is difficult for me that you will have problems with access to memory, if all you do is read. We could help you better if you can post the code ...

0
source

There are two obvious potential problems that spring is about.

  • Using more threads allocates more arrays that break the cache. Access to main memory or to lower cache levels is much slower.
  • If you use the same source of the random number generator instance, then the threads will fight for access to it. Perhaps this is not a complete synchronization, but rather, memory barriers with a blocking algorithm. Typically, blocking algorithms, although usually fast, become much slower with a high level of competition.
0
source

In addition to concurrency issues, the most likely cause of slowdown is a memory cache conflict.

If all threads access the same piece of memory, the likelihood that you want to access it may be in a different process memory cache.

If the storage is read-only, you can provide each thread with its own copy, which allows the JVM and processor to optimize the memory accessory.

0
source

I changed your test with the advice from the article I published. On my 2nd machine (which is all that I have now) the result seems reasonable (note that I did 2 tests for each thread number):

Maybe you can try this? (Please note that I had to slightly modify your test (see Comment), because it took a lot of time for my poor hardware.

Also note that I run this test using the -server .

 Test with threadNum 1 took 2095717473 ns Test with threadNum 1 took 2121744523 ns Test with threadNum 2 took 2489853040 ns Test with threadNum 2 took 2465152974 ns Test with threadNum 4 took 5044335803 ns Test with threadNum 4 took 5041235688 ns Test with threadNum 8 took 10279012556 ns Test with threadNum 8 took 10347970483 ns 

the code:

 import java.util.concurrent.*; public class Test{ private final static int arrSize = 20000000; public static void main(String[] args) throws Exception { int[] nums = {1,1,2,2,4,4,8,8};//allow hotspot optimization for (int threadNum : nums) { final CyclicBarrier gate = new CyclicBarrier(threadNum+1); final CountDownLatch latch = new CountDownLatch(threadNum); ExecutorService exec = Executors.newFixedThreadPool(threadNum); for(int i=0; i<threadNum; i++){ Runnable test = new Runnable(){ public void run() { try{ gate.await(); }catch(Exception e){ throw new RuntimeException(e); } int array[] = new int[arrSize]; //arrSize * 10 took very long to run so made it // just arrSize. for (long i = 0; i < arrSize; i++) { array[(int) ((i * i) % arrSize)]++; }//for long sum = 0; for (int i = 0; i < arrSize; i++){ sum += array[i]; } if(new Object().hashCode()==sum){ System.out.println("oh"); }//if latch.countDown(); }//run };//test exec.execute(test); }//for gate.await(); long start = System.nanoTime(); latch.await(); long finish = System.nanoTime(); System.out.println("Test with threadNum " + threadNum +" took " + (finish-start) + " ns "); exec.shutdown(); exec.awaitTermination(Long.MAX_VALUE,TimeUnit.SECONDS); }//for }//main }//Test 
0
source

Source: https://habr.com/ru/post/1315764/


All Articles