Oh, the horror of weak memory fading from horror ...
The first fragment of your basic atomic read-modify-write is if someone touches any address x1 , the store exception will not be executed, and it will try again until it succeeds. So far, so good. However, this only applies to the address (or more correct area) covered by the exclusive monitor, therefore, although it is good for atomicity, it is ineffective for synchronizing anything but this value.
Consider the case where CPU1 expects CPU0 to write some data to the buffer. CPU1 sits there, waiting for some kind of synchronization object (say, a semaphore), waiting for CPU0 to update it to signal that new data is ready.
- CPU0 writes to the data address.
- CPU0 increments the semaphore (atomically like you) that resides somewhere in memory.
- ???
- CPU1 sees the new semaphore value.
- CPU1 reads some data, which may or may not be old data, new data, or some combination of the two.
Now, what happened in step 3? Perhaps all this happened in order. It is possible that the hardware decided that, since there was no dependency on the addresses, this allowed the store semaphore to go to the store with the data address. Perhaps the semaphore store has hit the cache, while there is no data. Perhaps this is just because of complex reasons that only those hardware guys understand. In any case, it is quite possible that CPU1 will see a semaphore update before new data gets into memory, so return incorrect data.
To fix this, CPU0 should have a barrier between steps 1 and 2 to ensure that the data was definitely written before the semaphore was written. Having an atomic record is a barrier; it's a good easy way to do it. However, since barriers degrade performance pretty much, you need a lightweight, barrier-free version, and also in situations where you don't need such complete synchronization.
Now the even less intuitive part is that CPU1 can also change the order of its loads. Again, since there is no dependency on addresses, it would be free to speculate the data load before loading the semaphore, regardless of the CPU0 barrier. Thus, CPU1 also needs its own barrier between steps 4 and 5.
For a more authoritative, but rather heavy version, the version reads ARM Tests and cookbooks on the Litmus barrier . Be careful, this material can be misleading;)
On the side, in this case, the architectural semantics of the acquisition / release complicates the situation even more. Since they are only one-way barriers, and OSAtomicAdd32Barrier supplements the complete barrier to the code before and after it, it actually does not guarantee any ordering regarding the atomic operation itself - see this discussion from Linux for a more detailed explanation. Of course, this is from a theoretical point of view of architecture; in fact, it’s unthinkable that the A7 hardware adopted the “simple” LDAXR connection LDAXR to just do DMB+LDXR , etc., which means that they can get away from this, since they have the right to code their own implementation, and no specifications.