CAS and synchronized performance

I had this question for quite some time, trying to read a lot of resources and understand what was going on, but I still could not understand why it is.

Simply put, I'm trying to check how CAS will perform vs synchronized in competing rather than environments. I put this JMH test:

 @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS) @State(Scope.Benchmark) public class SandBox { Object lock = new Object(); public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder().include(SandBox.class.getSimpleName()) .jvmArgs("-ea", "-Xms10g", "-Xmx10g") .shouldFailOnError(true) .build(); new Runner(opt).run(); } @State(Scope.Thread) public static class Holder { private long number; private AtomicLong atomicLong; @Setup public void setUp() { number = ThreadLocalRandom.current().nextLong(); atomicLong = new AtomicLong(number); } } @Fork(1) @Benchmark public long sync(Holder holder) { long n = holder.number; synchronized (lock) { n = n * 123; } return n; } @Fork(1) @Benchmark public AtomicLong cas(Holder holder) { AtomicLong al = holder.atomicLong; al.updateAndGet(x -> x * 123); return al; } private Object anotherLock = new Object(); private long anotherNumber = ThreadLocalRandom.current().nextLong(); private AtomicLong anotherAl = new AtomicLong(anotherNumber); @Fork(1) @Benchmark public long syncShared() { synchronized (anotherLock) { anotherNumber = anotherNumber * 123; } return anotherNumber; } @Fork(1) @Benchmark public AtomicLong casShared() { anotherAl.updateAndGet(x -> x * 123); return anotherAl; } @Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking") @Benchmark public long syncSharedNonBiased() { synchronized (anotherLock) { anotherNumber = anotherNumber * 123; } return anotherNumber; } } 

And the results:

 Benchmark Mode Cnt Score Error Units spinLockVsSynchronized.SandBox.cas avgt 5 212.922 ± 18.011 ns/op spinLockVsSynchronized.SandBox.casShared avgt 5 4106.764 ± 1233.108 ns/op spinLockVsSynchronized.SandBox.sync avgt 5 2869.664 ± 231.482 ns/op spinLockVsSynchronized.SandBox.syncShared avgt 5 2414.177 ± 85.022 ns/op spinLockVsSynchronized.SandBox.syncSharedNonBiased avgt 5 2696.102 ± 279.734 ns/op 

In the non-general case, CAS much faster than I would expect. But in the general case, the opposite is true - and I cannot understand this. I don’t think that this is due to biased locking, since this will happen after the threads hold the lock for 5 seconds (AFAIK), and this will not happen, and the test is only proof of this.

I honestly hope that these are just my tests, which are wrong, and someone who has JMH experience will come and just point me to the wrong setup here.

+7
java-8 atomic compare-and-swap jmh atomic-long
source share
4 answers

The main misconception is the assumption that you are comparing " CAS vs. synchronized ". Given how complex JVMs implement synchronized , you compare the performance of the CAS algorithm using AtomicLong with the performance of the CAS algorithm used to implement synchronized .

Like Lock , the internal information for an object’s monitor basically consists of an int state indicating whether it belongs and how often it is nested, a link to the current thread of the owner and a queue of waiting threads to receive it. An expensive aspect is the waiting line. Putting a thread in a queue, removing it from thread scheduling, and ultimately waking it up when the current owner releases the monitor are operations that can take considerable time.

However, in an unprotected case, the waiting line is, of course, not involved. A monitor acquisition consists of a single CAS to change the status from "unowned" (usually zero) to "owned, acquired once" (guess the typical value). If successful, the thread can continue with a critical action, followed by a release, which simply means recording the “unowned” state with the necessary memory visibility and waking up another blocked thread, if any.

Since the waiting queue is a much more expensive thing, implementations usually try to avoid it even in the case under consideration, performing a number of revolutions, making several CAS retries, before returning back to the thread queue. If the critical action of the owner is as simple as a single multiplication, the likelihood that the monitor will be released at the molding stage will be high. Note that synchronized is "unfair", allowing the spinning thread to act immediately, even if threads already allocated have been waiting longer.

If you compare the fundamental operations performed by synchronized(lock){ n = n * 123; } synchronized(lock){ n = n * 123; } when the queue is not running, and al.updateAndGet(x -> x * 123); , you will notice that they are approximately equal. The main difference is that the AtomicLong approach will repeat the multiplication by contention, and for the synchronized approach there is a risk of getting into the queue if no progress was made during spinning.

But synchronized allows blocking escalation to re-synchronize code on the same object, which may be relevant for the benchmark cycle that syncShared method. If theres also not allowing the merging of several AtomicLong CAS updates, this can give synchronized significant advantage. (See also this article , covering several aspects discussed above)

Note that due to the “unfair” nature of synchronized creating a much larger number of threads than CPU cores should not be a problem. In the best case, the number of “threads minus the number of cores” ends in a queue, never waking up, while the remaining threads succeed in the spinning phase, one thread per core. But likewise, threads that do not run on the CPU core can slow down AtomicLong update AtomicLong because they cannot lead to anything, invalidate the current value for other threads, and make a failed CAS attempt.

In any case, when CAS ing for a member variable of an unexpanded object or when synchronized on an unexpanded object, the JVM can detect the local nature of the operation and exceed most of the associated costs. But this may depend on several subtle environmental aspects.


The bottom line is that there is no easy solution between atomic updates and synchronized blocks. Everything becomes much more interesting with more expensive operations, which may increase the likelihood that threads will be inserted into a synchronized conflict case, which may make it acceptable that the operation should be repeated in the declared atomic update case.

+13
source share

You should read, re-read and accept @Holger’s excellent answer, as the data it provides is much more valuable than one set of reference numbers from a single developer workstation.

I modified your tests to make them a bit more apples in apples, but if you read @Holger's answer, you will understand why this is not a very useful test. I will include my changes and my results to show how the results can vary from one machine (or one version of the JRE) to another.

Firstly, my version of the tests:

 @State(Scope.Benchmark) public class SandBox { public static void main(String[] args) throws RunnerException { new Runner( new OptionsBuilder().include(SandBox.class.getSimpleName()) .shouldFailOnError(true) .mode(Mode.AverageTime) .timeUnit(TimeUnit.NANOSECONDS) .warmupIterations(5) .warmupTime(TimeValue.seconds(5)) .measurementIterations(5) .measurementTime(TimeValue.seconds(5)) .threads(-1) .build() ).run(); } private long number = 0xCAFEBABECAFED00DL; private final Object lock = new Object(); private final AtomicLong atomicNumber = new AtomicLong(number); @Setup(Level.Iteration) public void setUp() { number = 0xCAFEBABECAFED00DL; atomicNumber.set(number); } @Fork(1) @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public long casShared() { return atomicNumber.updateAndGet(x -> x * 123L); } @Fork(1) @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public long syncShared() { synchronized (lock) { return number *= 123L; } } @Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking") @Benchmark @CompilerControl(CompilerControl.Mode.DONT_INLINE) public long syncSharedNonBiased() { synchronized (lock) { return number *= 123L; } } } 

And then my first batch of results:

 # VM version: JDK 1.8.0_60, VM 25.60-b23 Benchmark Mode Cnt Score Error Units SandBox.casShared avgt 5 976.215 ± 167.865 ns/op SandBox.syncShared avgt 5 1820.554 ± 91.883 ns/op SandBox.syncSharedNonBiased avgt 5 1996.305 ± 124.681 ns/op 

Recall that you saw synchronized , stepping forward in a highly competitive environment. At my workstation, the atomic version has improved. If you use my version of your tests, what results do you see? This will not surprise me at all if they differ significantly.

Here, another set is launched as part of the monthly version of Java 9 EA:

 # VM version: JDK 9-ea, VM 9-ea+170 Benchmark Mode Cnt Score Error Units SandBox.casShared avgt 5 979.615 ± 135.495 ns/op SandBox.syncShared avgt 5 1426.042 ± 52.971 ns/op SandBox.syncSharedNonBiased avgt 5 1649.868 ± 48.410 ns/op 

The difference is less dramatic. It’s not surprising to see the difference in the major versions of the JRE, but who says you won’t see them in the lower editions too?

At the end of the day, the results are close. Very close. synchronized performance has come a long way from earlier versions of Java. If you are not writing HFT algorithms or anything else that is incredibly latent sensitive, you should prefer the solution that is most easily verified correctly. It is generally easier to talk about synchronized than blocking algorithms and data structures. If you cannot demonstrate a measurable difference in your application, then synchronized is what you should use.

+4
source share

Please note that CAS can provide you with smaller guaranteed (not) guarantees than a synchronized block, especially with java-9 varhandles which provide order parameters that match the C ++ 11 memory model.

If all you want to do is save some statistics from several threads, then a read-calculate-update cycle with the most relaxed memory orders is available ( plain read ; simple and weak CAS ) may work better on weakly ordered platforms, since it doesn’t any barriers are needed, and cas won does not need to do a wasteful inner loop if it is implemented on top of LL / SC. In addition, it will also give JIT more freedom to change instructions around these atoms. compareAndExchange may eliminate additional read loop repetition.

Another complication is how you measure performance. All implementations must have guarantees of progress, i.e. Even with competition, at least one can finish at the same time. Thus, in principle, you can spend processor cycles on several threads, trying to update your variable at the same time, but still better as the latency of the 99th percentile, because atomic operations do not resort to removing the thread and worse with the worst-case delay, not fair. Thus, simply measuring averages may not tell the whole story here.

+3
source share

First of all, the code you write is java, which will create java-byte code that translates into various atomic operations on different sets of commands (Arm vs powerpc vs X86 ...), which can behave differently when implemented different vendors and even between architectures of the same manufacturer (e.g. Intel Core 2 Duo and Skylake). So it’s really hard to answer your question!

This article states that for proven X86 architectures, one execution of any atomic operation is performed similarly (very little difference between CAS, Fetch and add, swap), while CAS can fail and must be performed several times. In the case of a single thread, it will never fail.

fooobar.com/questions/685488 / ... :

Each object has a monitor associated with it. The thread that is monitoring receives ownership of the monitor associated with objectref. If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked, and then tries to obtain ownership again. If the current thread already owns the monitor associated with objectref, it increments the counter on the monitor, indicating the number of times this thread has entered the monitor. If the monitor associated with objectref does not belong to any thread, the current thread becomes the owner of the monitor by setting the record counter of this monitor to 1.

Let's look at the necessary operations in the case of CAS:

 public final int updateAndGet(IntUnaryOperator updateFunction) { int prev, next; do { prev = get(); next = updateFunction.applyAsInt(prev); } while (!compareAndSet(prev, next)); return next; } 

Extract x, multiply x, Cas by x, check if cas succeeded

Now this is effective in an unbiased case, since it requires a minimum number of operations . But in case the cache line is declared, it is not very efficient, because all threads actively rotate, while most of them fail. In addition, I remember that the rotation on the discussed skittle with atomic operation is very expensive.

Now an important part of synchronization is:

If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked

It depends on how this expectation is fulfilled.

a synchronized method could put the thread to sleep for a random time after it could not get the monitor. Also, instead of using the atomic operation to check if the monitor is free, it can do it with a simple read (it’s faster, but I couldn’t find a link to prove it).

My bet is that waiting in synchronous mode is implemented in a reasonable way and is optimized for situations with a conflict with one of the above methods or something similar, and therefore it is faster in the discussed scenario.

The trade is that in turbulent situations it is slower.

I still admit that I have no evidence.

0
source share

All Articles