What mechanisms other than mutexes or garbage collection can slow down my multi-threaded java program?

Problem

I have a piece of java code (JDK 1.6.0._22, if appropriate) that implements a stateless free function without side effects without any mutexes. However, it uses a lot of memory (I don't know how important this is).

In the past, I visited Sun Laboratories and put together a standard performance versus thread count curve. Since this function has no mutexes, it has a good graph, although garbage collection has skyrocketed as the number of threads increases. After some tweaking of garbage collection, I was able to make this curve almost flat.

Now I am doing the same experiment on Intel hardware. The hardware has 4 processors each with 8 cores and hyperthreading. This gives 64 available CPUs (). Unfortunately, the performance versus thread count curve scales well for 1, 2, 3 threads and caps on 3 threads. After 3 threads, I can put as many threads as I want, and performance will not improve

Attempts to fix the problem.

My first thought was that I was stupid and introduced some kind of synchronized code somewhere. Usually to solve this problem, I start JConsole or JVisualVM and look at the stack stack. If I have 64 threads running at speed 3, I expect 61 of them to sit, waiting for the mutex to enter. I have not found this. Instead, I found all the threads: very slowly.

The second thought was that perhaps the temporary structure introduces problems. I replaced my function with a dummy function that totals a billion using AtomicLong. It scales perfectly with the number of threads: I was able to count up to 10,000 billion times 64 times faster with 64 threads than with 1 thread.

I thought (despair), maybe garbage collection is really very time consuming, so I changed the garbage collection options. Although this improved my latency variation, it did not affect the throughput: I still have 64 threads running at the speed I expect 3 to execute.

I downloaded the Intel VTunes tool, but my skill with it is weak: it is a complex tool, and I still do not understand this. I have an order instruction: an interesting Christmas present for me, but it's too late to help my current problem.

Question

  • What tools (mental or software) can be used to improve understanding of what is happening?
  • What mechanisms other than mutexes or garbage collection can slow down my code?
+17
java performance garbage-collection multithreading
Dec 20
source share
3 answers

A lot of experimentation later, I discovered that the JVM was irrelevant, but I also discovered the power of JDump. 50 of the 64 threads were on the next line.

java.lang.Thread.State: RUNNABLE at java.util.Random.next(Random.java:189) at java.util.Random.nextInt(Random.java:239) at sun.misc.Hashing.randomHashSeed(Hashing.java:254) at java.util.HashMap.<init>(HashMap.java:255) at java.util.HashMap.<init>(HashMap.java:297) 

Random.next is as follows

  protected int next(int bits) { long oldseed, nextseed; AtomicLong seed = this.seed; do { oldseed = seed.get(); nextseed = (oldseed * multiplier + addend) & mask; } while (!seed.compareAndSet(oldseed, nextseed)); return (int)(nextseed >>> (48 - bits)); } 

The most interesting thing is that this is not an obvious lock, so the tools that I use to detect mutexes did not work.

So it seems that any java hash map creation causes applications to stop being scalable (I exaggerate, but not really). My application uses hashmaps a lot, so I think I’m either rewriting the hash map or rewriting the application.

I raise a separate question to understand how to handle this.

thanks for the help

+9
Dec 22
source share

I have a piece of java code (JDK 1.6.0._22, if necessary)

Since then there have been quite significant improvements in performance. I would try updating Java 6 update 37 or Java 7 10.

However it uses a lot of memory

This may mean that it is important to have access to your data. Accessing data in main memory can be 20 + x slower than in your main cache. This means that you need to access your data conservatively and make the most of every part of the new data that you are accessing.

After 3 threads, I can set as many threads as I want, and performance will not improve. Instead, I found all the threads: very slowly.

This assumes that you are using the maximum resource for this. The most likely resource, which should be maximum, given the amount of memory used, is the processor for the main memory bridge. I suspect you have one bridge for 64 threads! This means that you should consider ways that can use more processors, but improve memory access (less randomly and more consistently) and reduce the amount of use (if necessary, use more compact types). for example, I have a type of "short with two decimal places" instead of a float, which can use half the memory used.

As you noticed, when each thread updates its own AtomicLong, you get linear scalability. This will not use the processor for the main memory bridge at all.




From @Marko

Peter, do you have an idea how these multi-tier architectures work with memory? Anyway?

Not as much as we would like, because this problem is not visible to Java.

Does each core have an independent channel?

Each core has an independent channel for the primary cache. For an external cache, there may be a channel for each or 2-6 cache areas, but under heavy load you will encounter a large number of collisions.

There is one very wide channel for the bridge in the main memory. This contributes to long sequential access, but is very bad for random access. One thread can maximize this with random reads (random enough, they do not fit in the external cache)

Or at least independent, in the absence of collisions?

Once you run out of primary cache (L1, usually 32 KB), it conflicts completely.

Because otherwise, scaling is a big problem.

As the OP shows. Most applications either a) spend a significant portion of their time waiting for I / O b) do the highlighting of calculations on small batches of data. Performing calculations on large amounts of data is the worst case of senario.

The way I do this is to arrange the data structures in memory for sequential access. I use heap memory, which is a pain, but gives you full control over the layout. (My source data is a memory card to save). I transfer data using sequential access and try to maximize the use of this data (i.e. Minimize re-access). Even then, with 16 cores, it is difficult to assume that they will all be used efficiently, since I have 40 GB of source data that I am working on at any given time, and about 80 GB of received data.

Note. High-performance GPUs solve this problem with incredibly high memory bandwidth. A top-level processor can receive 250 GB / s, while a typical processor is about 4-6 GB / s. However, they are better suited for vectorized processing, and their cited peak performance is likely to have little memory access, for example. mandelbrot.

http://www.nvidia.com/object/tesla-servers.html

+11
Dec 20
source share

Perhaps you are faced with a distribution wall : your program can run no faster than the distribution of objects, which is limited by memory bandwidth.

0
Dec 23 '12 at 13:29
source share



All Articles