Modeling LDREX / STREX (loading / saving exclusive) in Cortex-M0

Question

Modeling LDREX / STREX (loading / saving exclusive) in Cortex-M0

There is a family of LDREX / STREX instructions in the Cortex-M3 instruction set, so if a location is read using an LDREX instruction, the next STREX instruction can write to this address only if the address is known to be untouched. As a rule, the effect is that STREX will succeed if there are no interruptions with LDREX (“exceptions” in the ARM language), but it does not work otherwise.

What is the most practical way to simulate this behavior in the Cortex M0? I would like to write C code for M3 and port it to M0. On M3, you can say something like:

 __inline void do_inc (unsigned int * dat)
 {
   while (__ strex (__ ldrex (dat) + 1, dat)) {}
 }

to perform atomic increment. The only ways I could achieve similar functionality on the Cortex-M0 would be as follows:

ldrex disables exceptions, and strex and clrex re-activate them, requiring that each ldrex follow it shortly after strex or clrex.
"ldrex", "strex" and "clrex" will be very small subroutines in RAM, with one command "ldrex" being fixed either "str r1, [r2]" or "mov r0, # 1". Ask the "ldrex" procedure to include the "str" instruction in the "strex" routine, and there is a subroutine "clrex" plug "mov r0, # 1". They have all exceptions that can lead to invalidation of a call to the sequence "ldrex" "clrex".

Depending on how the ldrex / strex functions are used, disabling interrupts may work reasonably, but it seems impractical to change the semantics of "load-exclusive" to cause bad side effects if it fails. The idea for correcting the code seems such that it will achieve the desired semantics, but it seems awkward.

(BTW, side question: I wonder why STREX on the M3 stores the success / failure indication in the register rather than just setting a flag? Its actual operation requires four additional bits in the operation code, requires that the register is available to hold the success / failure indication and require that "cmp r0, # 0" be used to determine if it was possible. Was it expected that compilers would not be able to efficiently process STREX if they did not get the result in the register? Getting into the register takes two short instructions.)

+8

assembly arm atomic cortex-m

supercat Apr 21 '11 at 2:59 p.m.

source share

3 answers

domen · Answer 1 · 2011-04-25T10:22:24+0000

~~Well ... you still have SWP left, but this is a less powerful atomic instruction.~~

Disabling interrupt must work. :-)

Edit:

No SWP at -m0, sorry supercat.

OK, it looks like you are staying with the interrupt disabled. You can use gcc-compilable inline asm as a guide to shutting down and restoring it correctly: http://repo.or.cz/w/cbaos.git/blob/HEAD:/arch/arm-cortex-m0/include/lock. h

Alexandre Pereira Nunes · Answer 2 · 2015-02-05T22:53:54+0000

Cortex-M3 was designed for multitasking with low latency and low jitter, i.e. the interrupt controller interacts with the kernel to maintain guarantees of the number of cycles from the moment the interrupt is triggered to handle interrupts. Ldrex / strex was implemented as a way of interacting with all of this (with all of this I mean masking interrupts and other details, such as atomic bit tuning using bit aliases), since a single-core, non-MMU, not cached by μC would otherwise be of little use for of this. If he hadn’t implemented it, a low-priority task would have to block the lock, and it could generate small priority inversions, generating delay and jitter, which are a rigid real-time system (for this it is developed, although the concept is too broad) is not can handle, at least not within the order of magnitude allowed by the "repeat" semantics, which had an unsuccessful ldrex / strex.

On the side of the note, and speaking strictly in terms of timings and jitter, the Cortex-M0 has a more traditional interrupt synchronization profile (that is, it will not interrupt kernel instructions when an interrupt arrives), being prone to more jitter and latency. On this issue (again, a strict deadline), it is more comparable with older models (for example, arm7tdmi), which also does not have atomic load / change / storage, as well as atomic increments and decrements and other low-latency commands requiring disconnection / allowed more often.

I use something similar in Cortex-M3:

 #define unlikely(x) __builtin_expect((long)(x),0) static inline int atomic_LL(volatile void *addr) { int dest; __asm__ __volatile__("ldrex %0, [%1]" : "=r" (dest) : "r" (addr)); return dest; } static inline int atomic_SC(volatile void *addr, int32_t value) { int dest; __asm__ __volatile__("strex %0, %2, [%1]" : "=&r" (dest) : "r" (addr), "r" (value) : "memory"); return dest; } /** * atomic Compare And Swap * @param addr Address * @param expected Expected value in *addr * @param store Value to be stored, if (*addr == expected). * @return 0 ok, 1 failure. */ static inline int atomic_CAS(volatile void *addr, int32_t expected, int32_t store) { int ret; do { if (unlikely(atomic_LL(addr) != expected)) return 1; } while (unlikely((ret = atomic_SC(addr, store)))); return ret; }

In other words, it takes ldrex / strex to the well-known Linked-Load and Store Conditional, and with it it also implements Compare-and-Swap semantics.

If your code works just fine with comparison and replacement, you can implement it for cortex-m0 as follows:

 static inline int atomic_CAS(volatile void *addr, int32_t expected, int32_t store) { int ret = 1; __interrupt_disable(); if (*(volatile uint32_t *)addr) == expected) { *addr = store; ret = 0; } __interrupt_enable(); return ret; }

This is the most used pattern because some architectures used it (x86 comes to mind). Implementing the LL / SC CAS pattern emulation seems ugly from where I stand. Specially, when SC is more than a few instructions other than LL, but although very common, ARM does not recommend it specifically in the case of Cortex-M3, because since any interruptions will cause strex to fail if you start to go between ldrex for too long / strex your code will spend a lot of time on a loop repeating strex. This is an abuse of the template, not its use.

As for your side question, in the case of cortex-m3, the arrow returns to the register, because the semantics have already been defined by higher-level architectures (strex / ldrex exists in the multi-core shoulders that were implemented before and after armv7-m where cache controllers actually check ldrex / strex addresses, i.e. strex fails only when the cache cannot prove dataline, the address addresses have not been changed).

If I were thinking, I would say that it was because in the early days this type of atomics was designed in libraries: you returned success / failure in the functions implemented in assembler, and this should have respected ABI and most of them ( all I know) uses a register or stack, not flags, to return values. This may also be due to the fact that compilers use register coloring better than flag cloning if some other instruction uses it, i.e. Consider a complex operation that generates flags, and in the middle it has a sequence of ldrex / strex and an operation which then requires flags: the compiler will have to move the flags to the register anyway.

old_timer · Answer 3 · 2014-03-14T17:12:38+0000

STREX / LDREX are designed for multi-core processors to access common elements in memory that are shared in the cores. ARM did an unusually poor job of documenting this, you have to read between the lines in the amba / axi and arm and trm docs to figure this out.

How it works, IF you have a kernel that supports STREX / LDREX and IF you have a memory controller that supports exclusive access, then IF the memory controller sees a couple of exclusive operations without any other kernel accessing this memory between you, return EX_OKAY , not OKAY. The documentation on the shoulders tells chip developers if it is a single processor (without implementing a multi-core function), then you do not need to support exokay, just go back in order, which from the software point of view breaks the LDREX / STREX pair to access this logic (the software rotates in infinite loop since it will never return success), L1 cache supports it, although it feels like it is working.

For uniprocessor devices and for cases when you do not get access to memory shared across all cores, use SWP.

-m0 does not support ldrex / strex or swp, but what do you mostly get? They simply gain access to you that is not affected by your access. so that you do not stomp on yourself, then just turn off interrupts for a while, as we have done atomic conversions from the time of darkness. if you need protection from you and peripherals, if you have peripherals that can intervene, well, there is no way around this, and even a swap might not help.

So just turn off interrupts around the critical section.

Modeling LDREX / STREX (loading / saving exclusive) in Cortex-M0

More articles: