Cortex-M3 was designed for multitasking with low latency and low jitter, i.e. the interrupt controller interacts with the kernel to maintain guarantees of the number of cycles from the moment the interrupt is triggered to handle interrupts. Ldrex / strex was implemented as a way of interacting with all of this (with all of this I mean masking interrupts and other details, such as atomic bit tuning using bit aliases), since a single-core, non-MMU, not cached by μC would otherwise be of little use for of this. If he hadn’t implemented it, a low-priority task would have to block the lock, and it could generate small priority inversions, generating delay and jitter, which are a rigid real-time system (for this it is developed, although the concept is too broad) is not can handle, at least not within the order of magnitude allowed by the "repeat" semantics, which had an unsuccessful ldrex / strex.
On the side of the note, and speaking strictly in terms of timings and jitter, the Cortex-M0 has a more traditional interrupt synchronization profile (that is, it will not interrupt kernel instructions when an interrupt arrives), being prone to more jitter and latency. On this issue (again, a strict deadline), it is more comparable with older models (for example, arm7tdmi), which also does not have atomic load / change / storage, as well as atomic increments and decrements and other low-latency commands requiring disconnection / allowed more often.
I use something similar in Cortex-M3:
#define unlikely(x) __builtin_expect((long)(x),0) static inline int atomic_LL(volatile void *addr) { int dest; __asm__ __volatile__("ldrex %0, [%1]" : "=r" (dest) : "r" (addr)); return dest; } static inline int atomic_SC(volatile void *addr, int32_t value) { int dest; __asm__ __volatile__("strex %0, %2, [%1]" : "=&r" (dest) : "r" (addr), "r" (value) : "memory"); return dest; } /** * atomic Compare And Swap * @param addr Address * @param expected Expected value in *addr * @param store Value to be stored, if (*addr == expected). * @return 0 ok, 1 failure. */ static inline int atomic_CAS(volatile void *addr, int32_t expected, int32_t store) { int ret; do { if (unlikely(atomic_LL(addr) != expected)) return 1; } while (unlikely((ret = atomic_SC(addr, store)))); return ret; }
In other words, it takes ldrex / strex to the well-known Linked-Load and Store Conditional, and with it it also implements Compare-and-Swap semantics.
If your code works just fine with comparison and replacement, you can implement it for cortex-m0 as follows:
static inline int atomic_CAS(volatile void *addr, int32_t expected, int32_t store) { int ret = 1; __interrupt_disable(); if (*(volatile uint32_t *)addr) == expected) { *addr = store; ret = 0; } __interrupt_enable(); return ret; }
This is the most used pattern because some architectures used it (x86 comes to mind). Implementing the LL / SC CAS pattern emulation seems ugly from where I stand. Specially, when SC is more than a few instructions other than LL, but although very common, ARM does not recommend it specifically in the case of Cortex-M3, because since any interruptions will cause strex to fail if you start to go between ldrex for too long / strex your code will spend a lot of time on a loop repeating strex. This is an abuse of the template, not its use.
As for your side question, in the case of cortex-m3, the arrow returns to the register, because the semantics have already been defined by higher-level architectures (strex / ldrex exists in the multi-core shoulders that were implemented before and after armv7-m where cache controllers actually check ldrex / strex addresses, i.e. strex fails only when the cache cannot prove dataline, the address addresses have not been changed).
If I were thinking, I would say that it was because in the early days this type of atomics was designed in libraries: you returned success / failure in the functions implemented in assembler, and this should have respected ABI and most of them ( all I know) uses a register or stack, not flags, to return values. This may also be due to the fact that compilers use register coloring better than flag cloning if some other instruction uses it, i.e. Consider a complex operation that generates flags, and in the middle it has a sequence of ldrex / strex and an operation which then requires flags: the compiler will have to move the flags to the register anyway.