ARM64: LDXR / STXR vs LDAXR / STLXR

IOS has two similar features: OSAtomicAdd32 and OSAtomicAdd32Barrier . I am wondering when you need the Barrier option.

Analysis, they:

 _OSAtomicAdd32: ldxr w8, [x1] add w8, w8, w0 stxr w9, w8, [x1] cbnz w9, _OSAtomicAdd32 mov x0, x8 ret lr _OSAtomicAdd32Barrier: ldaxr w8, [x1] add w8, w8, w0 stlxr w9, w8, [x1] cbnz w9, _OSAtomicAdd32Barrier mov x0, x8 ret lr 

In what scenarios do you need the Load-Acquire / Store-Release semantics of the latter? Can I reorder LDXR / STXR ? If they can, is it possible to “lose atomic renewal” in the absence of a barrier? From what I read, it seems like this can't happen, and if that's true, then why do you need the Barrier option? Perhaps only if you also needed a DMB for other purposes?

Thanks!

+6
source share
3 answers

OSAtomicAdd32Barrier() exists for people who use OSAtomicAdd() for something other than a simple atomic increment. In particular, they implement their own OSAtomicAdd() synchronization primitives based on OSAtomicAdd() . For example, creating your own mutex library. OSAtomicAdd32Barrier() uses heavy barrier instructions to ensure memory ordering on both sides of an atomic operation. This is undesirable in normal use.

Summarizing:

1) If you just want to increase the number of integers in a thread-safe way, use OSAtomicAdd32()

2) If you are stuck with a bunch of old code that stupidly assumes that OSAtomicAdd32() can be used as a request processing order and a speculation barrier, replace it with OSAtomicAdd32Barrier()

+7
source

Oh, the horror of weak memory fading from horror ...

The first fragment of your basic atomic read-modify-write is if someone touches any address x1 , the store exception will not be executed, and it will try again until it succeeds. So far, so good. However, this only applies to the address (or more correct area) covered by the exclusive monitor, therefore, although it is good for atomicity, it is ineffective for synchronizing anything but this value.

Consider the case where CPU1 expects CPU0 to write some data to the buffer. CPU1 sits there, waiting for some kind of synchronization object (say, a semaphore), waiting for CPU0 to update it to signal that new data is ready.

  • CPU0 writes to the data address.
  • CPU0 increments the semaphore (atomically like you) that resides somewhere in memory.
  • ???
  • CPU1 sees the new semaphore value.
  • CPU1 reads some data, which may or may not be old data, new data, or some combination of the two.

Now, what happened in step 3? Perhaps all this happened in order. It is possible that the hardware decided that, since there was no dependency on the addresses, this allowed the store semaphore to go to the store with the data address. Perhaps the semaphore store has hit the cache, while there is no data. Perhaps this is just because of complex reasons that only those hardware guys understand. In any case, it is quite possible that CPU1 will see a semaphore update before new data gets into memory, so return incorrect data.

To fix this, CPU0 should have a barrier between steps 1 and 2 to ensure that the data was definitely written before the semaphore was written. Having an atomic record is a barrier; it's a good easy way to do it. However, since barriers degrade performance pretty much, you need a lightweight, barrier-free version, and also in situations where you don't need such complete synchronization.

Now the even less intuitive part is that CPU1 can also change the order of its loads. Again, since there is no dependency on addresses, it would be free to speculate the data load before loading the semaphore, regardless of the CPU0 barrier. Thus, CPU1 also needs its own barrier between steps 4 and 5.

For a more authoritative, but rather heavy version, the version reads ARM Tests and cookbooks on the Litmus barrier . Be careful, this material can be misleading;)

On the side, in this case, the architectural semantics of the acquisition / release complicates the situation even more. Since they are only one-way barriers, and OSAtomicAdd32Barrier supplements the complete barrier to the code before and after it, it actually does not guarantee any ordering regarding the atomic operation itself - see this discussion from Linux for a more detailed explanation. Of course, this is from a theoretical point of view of architecture; in fact, it’s unthinkable that the A7 hardware adopted the “simple” LDAXR connection LDAXR to just do DMB+LDXR , etc., which means that they can get away from this, since they have the right to code their own implementation, and no specifications.

+11
source

I would suggest that this is just a way of reproducing the existing architecture-independent semantics for this operation.

Using the ldaxr / stlxr , the above sequence will ensure proper ordering if AtomicAdd32 is used as a synchronization mechanism (mutex / semaphore) - regardless of whether the resulting operation is a higher level acquisition or release.

So, this does not mean to ensure the consistency of atomic additions, but about a mandatory ordering procedure between receiving / issuing a mutex and any operations performed on a resource protected by this mutex.

It is less efficient than ldxar / stxr or ldxr / stlxr , which you would use in the usual synchronization mechanism, but if you have existing platform- stlxr code awaiting atomic addition with semantics, this is probably the best way to implement it.

+3
source

All Articles