I am writing a multi-threaded java application that runs on a Nehalem processor. However, I have a problem: starting from 4 threads, I almost see no acceleration in my application.
I did some simple tests. I created a stream that just allocates a large array and makes access to random entries in the array. Therefore, when I start the number of threads, the operating time should not change (provided that I did not exceed the number of available processor cores). But I noticed that starting 1 or 2 threads takes almost the same time, but running 4 or 8 threads is much slower. Therefore, before trying to solve the problem of algorithmic and synchronization in my application, I want to find out that the maximum possible parallelization that I can achieve is possible.
I used the -XX:+UseNUMA JVM option, so arrays should be allocated in memory next to the corresponding threads.
PS If the threads performed a simple mathematical calculation, there was no time for 4 or even 8 threads, so I came to the conclusion that when the threads access memory, I have some problems.
Any help or ideas are appreciated, thanks.
EDIT
Thanks everyone for the answers. I see that I have not explained myself well enough.
Before trying to fix the synchronization problems in my application, I did a simple test that checks the best possible parallelization that can be achieved. The code is as follows:
public class TestMultiThreadingArrayAccess { private final static int arrSize = 40000000; private class SimpleLoop extends Thread { public void run() { int array[] = new int[arrSize]; for (long i = 0; i < arrSize * 10; i++) { array[(int) ((i * i) % arrSize)]++;
So, you see that in this ministry there is no synchronization, nor is the distribution of the array inside the stream, so it should be placed in a piece of memory that can be quickly accessed. There are also no memory statements in this code. Nevertheless, for 4 threads in the process, there is a decrease of 30%, and 8 threads work twice as slow. Since you are from code, I just wait until all threads finish their work, and since their work is independent, the number of threads should not affect the total execution time.
There are 2 quad-core Nehalem hyperprocessors installed on the machine (16 processors in total), so with 8 threads everyone can capture it exclusively by the CPU.
When I tried to run this test with a smaller array (20 thousand records), the drop in execution time from 4 threads was 7% and 8 threads - 14%, which is satisfactory. But when I try to use random access to a large array (40M records), the operating time increases dramatically, so I think that there is a problem that large chunks of memory (because they do not fit into the cache?) Are available in non- effective way.
Any ideas how to fix this?
Hope this clears up the question better, thanks again.