The main misconception is the assumption that you are comparing " CAS vs. synchronized ". Given how complex JVMs implement synchronized , you compare the performance of the CAS algorithm using AtomicLong with the performance of the CAS algorithm used to implement synchronized .
Like Lock , the internal information for an objectâs monitor basically consists of an int state indicating whether it belongs and how often it is nested, a link to the current thread of the owner and a queue of waiting threads to receive it. An expensive aspect is the waiting line. Putting a thread in a queue, removing it from thread scheduling, and ultimately waking it up when the current owner releases the monitor are operations that can take considerable time.
However, in an unprotected case, the waiting line is, of course, not involved. A monitor acquisition consists of a single CAS to change the status from "unowned" (usually zero) to "owned, acquired once" (guess the typical value). If successful, the thread can continue with a critical action, followed by a release, which simply means recording the âunownedâ state with the necessary memory visibility and waking up another blocked thread, if any.
Since the waiting queue is a much more expensive thing, implementations usually try to avoid it even in the case under consideration, performing a number of revolutions, making several CAS retries, before returning back to the thread queue. If the critical action of the owner is as simple as a single multiplication, the likelihood that the monitor will be released at the molding stage will be high. Note that synchronized is "unfair", allowing the spinning thread to act immediately, even if threads already allocated have been waiting longer.
If you compare the fundamental operations performed by synchronized(lock){ n = n * 123; } synchronized(lock){ n = n * 123; } when the queue is not running, and al.updateAndGet(x -> x * 123); , you will notice that they are approximately equal. The main difference is that the AtomicLong approach will repeat the multiplication by contention, and for the synchronized approach there is a risk of getting into the queue if no progress was made during spinning.
But synchronized allows blocking escalation to re-synchronize code on the same object, which may be relevant for the benchmark cycle that syncShared method. If theres also not allowing the merging of several AtomicLong CAS updates, this can give synchronized significant advantage. (See also this article , covering several aspects discussed above)
Note that due to the âunfairâ nature of synchronized creating a much larger number of threads than CPU cores should not be a problem. In the best case, the number of âthreads minus the number of coresâ ends in a queue, never waking up, while the remaining threads succeed in the spinning phase, one thread per core. But likewise, threads that do not run on the CPU core can slow down AtomicLong update AtomicLong because they cannot lead to anything, invalidate the current value for other threads, and make a failed CAS attempt.
In any case, when CAS ing for a member variable of an unexpanded object or when synchronized on an unexpanded object, the JVM can detect the local nature of the operation and exceed most of the associated costs. But this may depend on several subtle environmental aspects.
The bottom line is that there is no easy solution between atomic updates and synchronized blocks. Everything becomes much more interesting with more expensive operations, which may increase the likelihood that threads will be inserted into a synchronized conflict case, which may make it acceptable that the operation should be repeated in the declared atomic update case.