I am investigating the best multi-threaded increment performance. I checked the implementation based on synchronization, AtomicInteger, and a custom implementation, as in AtomicInteger, but with parkNanos (1), with a failed CAS.
private int customAtomic() { int ret; for (;;) { ret = intValue; if (unsafe.compareAndSwapInt(this, offsetIntValue, ret, ++ret)) { break; } LockSupport.parkNanos(1); } return ret; }
I did a test based on JMH: explicit execution of each method, each of which uses a processor (1,2,4,8,16 times) and consumes only a processor. Each reference method is performed on an Intel (R) Xeon (R) processor E5-1680 v2 @ 3.00 GHz, 8 Core + 8 HT 64Gb RAM, in 1-17 threads. The results surprised me:
- CAS is most efficient in 1 thread. 2 thread - a similar result with the monitor. 3 or more - worse than a monitor, ~ 2 times.
- In most cases, the user implementation is 2-3 times better than the monitor.
- But in a user implementation, random execution sometimes happens. A good case is 50 op / microsec. A bad case is 0.5 op / microsec.
Questions:
- Why is AtomicInteger not based on synchronization, is it more productive than the current impl?
- Why doesn't AtomicInteger use LockSupport.parkNanos (1), doesn't it work on CAS?
- Why is this happening in a custom implementation?

I tried to run this test several times, and the surge always occurs in different numbers. I also tried this test on other machines, the result is the same. Maybe these are problems in the test. In the "bad case" of custom imports in StackProfiler, I see:
....[Thread state distributions].................................................................... 50.0% RUNNABLE 49.9% TIMED_WAITING ....[Thread state: RUNNABLE]........................................................................ 43.3% 86.6% sun.misc.Unsafe.park 5.8% 11.6% com.jad.generated.IncrementBench_incrementCustomAtomicWithWork_jmhTest.incrementCustomAtomicWithWork_thrpt_jmhStub 0.8% 1.7% org.openjdk.jmh.infra.Blackhole.consumeCPU 0.1% 0.1% com.jad.IncrementBench$Worker.work 0.0% 0.0% java.lang.Thread.currentThread 0.0% 0.0% com.jad.generated.IncrementBench_incrementCustomAtomicWithWork_jmhTest._jmh_tryInit_f_benchmarkparams1_0 0.0% 0.0% org.openjdk.jmh.infra.generated.BenchmarkParams_jmhType_B1.<init> ....[Thread state: TIMED_WAITING]................................................................... 49.9% 100.0% sun.misc.Unsafe.park
In the "good case":
....[Thread state distributions].................................................................... 88.2% TIMED_WAITING 11.8% RUNNABLE ....[Thread state: TIMED_WAITING]................................................................... 88.2% 100.0% sun.misc.Unsafe.park ....[Thread state: RUNNABLE]........................................................................ 5.6% 47.9% sun.misc.Unsafe.park 3.1% 26.3% org.openjdk.jmh.infra.Blackhole.consumeCPU 2.4% 20.3% com.jad.generated.IncrementBench_incrementCustomAtomicWithWork_jmhTest.incrementCustomAtomicWithWork_thrpt_jmhStub 0.6% 5.5% com.jad.IncrementBench$Worker.work 0.0% 0.0% com.jad.generated.IncrementBench_incrementCustomAtomicWithWork_jmhTest.incrementCustomAtomicWithWork_Throughput 0.0% 0.0% java.lang.Thread.currentThread 0.0% 0.0% org.openjdk.jmh.infra.generated.BenchmarkParams_jmhType_B1.<init> 0.0% 0.0% sun.misc.Unsafe.putObject 0.0% 0.0% org.openjdk.jmh.runner.InfraControlL2.announceWarmdownReady 0.0% 0.0% sun.misc.Unsafe.compareAndSwapInt
Link to test code
Link to graphical results. X - number of threads, Y - thpt, op / microsec
Link to RAW Magazine
UPD
Well, I know, I understand that when I use parkNanos, a single thread can also hold a lock (CAS) for long periods of time. Themes, with CAS-fail, go to sleep, and only one thread does the work and increases the value. I see that for a large level of concurrency when the work is so small - AtomicInteger is not the best approach. But if we increase workSize, for example, to = CASThrpt / threadNum, it should work fine: For the local machine, I set workSize = 300, the result of my test:
Benchmark (workSize) Mode Cnt Score Error Units IncrementBench.incrementAtomicWithWork 300 thrpt 3 4.133 ± 0.516 ops/us IncrementBench.incrementCustomAtomicWithWork 300 thrpt 3 1.883 ± 0.234 ops/us IncrementBench.lockIntWithWork 300 thrpt 3 3.831 ± 0.501 ops/us IncrementBench.onlyWithWork 300 thrpt 3 4.339 ± 0.243 ops/us
AtomicInteger - win, lock - second place, custom - third. But the problem with the spikes is still not clear. And I forgot about the java version: Java (TM) SE Runtime Environment (build 1.7.0_79-b15) Java HotSpot (TM) 64-bit server VM (build 24.79-b02, mixed mode)