I have a piece of java code (JDK 1.6.0._22, if necessary)
Since then there have been quite significant improvements in performance. I would try updating Java 6 update 37 or Java 7 10.
However it uses a lot of memory
This may mean that it is important to have access to your data. Accessing data in main memory can be 20 + x slower than in your main cache. This means that you need to access your data conservatively and make the most of every part of the new data that you are accessing.
After 3 threads, I can set as many threads as I want, and performance will not improve. Instead, I found all the threads: very slowly.
This assumes that you are using the maximum resource for this. The most likely resource, which should be maximum, given the amount of memory used, is the processor for the main memory bridge. I suspect you have one bridge for 64 threads! This means that you should consider ways that can use more processors, but improve memory access (less randomly and more consistently) and reduce the amount of use (if necessary, use more compact types). for example, I have a type of "short with two decimal places" instead of a float, which can use half the memory used.
As you noticed, when each thread updates its own AtomicLong, you get linear scalability. This will not use the processor for the main memory bridge at all.
From @Marko
Peter, do you have an idea how these multi-tier architectures work with memory? Anyway?
Not as much as we would like, because this problem is not visible to Java.
Does each core have an independent channel?
Each core has an independent channel for the primary cache. For an external cache, there may be a channel for each or 2-6 cache areas, but under heavy load you will encounter a large number of collisions.
There is one very wide channel for the bridge in the main memory. This contributes to long sequential access, but is very bad for random access. One thread can maximize this with random reads (random enough, they do not fit in the external cache)
Or at least independent, in the absence of collisions?
Once you run out of primary cache (L1, usually 32 KB), it conflicts completely.
Because otherwise, scaling is a big problem.
As the OP shows. Most applications either a) spend a significant portion of their time waiting for I / O b) do the highlighting of calculations on small batches of data. Performing calculations on large amounts of data is the worst case of senario.
The way I do this is to arrange the data structures in memory for sequential access. I use heap memory, which is a pain, but gives you full control over the layout. (My source data is a memory card to save). I transfer data using sequential access and try to maximize the use of this data (i.e. Minimize re-access). Even then, with 16 cores, it is difficult to assume that they will all be used efficiently, since I have 40 GB of source data that I am working on at any given time, and about 80 GB of received data.
Note. High-performance GPUs solve this problem with incredibly high memory bandwidth. A top-level processor can receive 250 GB / s, while a typical processor is about 4-6 GB / s. However, they are better suited for vectorized processing, and their cited peak performance is likely to have little memory access, for example. mandelbrot.
http://www.nvidia.com/object/tesla-servers.html