__sync_xxx() modeled on some Intel primitives, and on your x86 the atomic load / storage is pretty trivial, and I think why the set looks incomplete.
For an atom store, I think you are stuck with __sync_val_compare_and_swap() , although, for example, __sync_fetch_and_add() to load, this is obviously overkill: - (
There is a "full memory barrier" __sync_synchronize() , but I could not find out what it does (except for the experiment, on x86_64)! If you know exactly which machines you are compiling on, you might be interested in sucking and seeing ... starting with loading and storing wrapped in __sync_synchronize() .
I can tell you that for x86 and x86_64, atomic loads do not require any extraordinary readings. Atomic stores require mfence if you want memory_order_seq_cst , but not otherwise. HOWEVER ... another, missing from the __sync_xxx family, is a compiler barrier ... if only what __sync_synchronize() does!
Added later ...
I recommend C / C ++ 11 matching for processors for a good description of how atomic engineering can / should be implemented on x86 / x86_64, ARM, and PowerPC.
To use __sync_val_compare_and_swap() as an int atom store:
void a_store(int* p_ai, int val) { int ai_was ; ai_was = *p_ai ; do { ai_was = __sync_val_compare_and_swap (p_ai, ai_was, val) ; } ;
On your x86 / x86_64 for memory_order_seq_cst (SC) you need either LOCK XCHG or MOV followed by mfence ... so using LOCK CMPXCHG in a loop is a bit painful. For ARM, this is also a little painful, but even more so :-(
Manual loading of atomic load / storage is strictly for the brave (or reckless) ... and, depending on what __sync_synchronize () actually does on this machine, it may or may not work!
So the trivial approach is:
__sync_synchronize() ; v = v_atomic ; // atomic load ! __sync_synchronize() ; __sync_synchronize() ; v_atomic = v ; // atomic store ! __sync_synchronize() ;
What compiles for x86 / x86_64 (for me, on gcc 4.8 for x86_64):
mfence mov xxx, xxx mfence
to download and save. Which is definitely safe (and SC) ... for loading, it may or may not be better than LOCK XADD ... for saving, it may be better than LOCK CMPXCHG and the loop around it!
If (and only if) for ARM, this compiles to:
dmb ldr/str dmb
Then it is safe (and SC).
Now ... for x86 / x86_64 for the processor you do not need mfence to load at all, even for SC. But you need to stop the compiler from reordering. __sync_synchronize() does this and also sets mfence . For gcc, you can build __sync_compiler() with the following voodoo:
#define __sync_compiler() __asm__ __volatile__("":::"memory")
I believe that __sync_synchronize() (for x86 / x86_64) is effective:
#define __sync_mfence() __asm__ __volatile__("mfence":::"memory")
Since x86 / x86_64 behaves so well, you can:
__sync_compiler() ; v = v_atomic ; // atomic load -- memory_order_seq_cst __sync_compiler() ; __sync_compiler() ; v_atomic = v ; // atomic store -- memory_order_seq_cst __sync_synchronize() ;
And ... if you can live with memory_order_release, then you can replace the only remaining _sync_synchronize() with _sync_compiler() !
Now, for ARMv7 ... if (and only if - I donโt have an ARM, so I canโt check it) __sync_synchronize() compiles into dmb , then we can do it a little better for loading:
__sync_compiler() ; v = v_atomic ; // atomic load __sync_synchronize() ;
for all memory orders: memory_order_seq_cst and _acquire (and _consume).
And for memory_order_release we can:
__sync_synchronize() ; v_atomic = v ;
For ARMv8, it seems that there are special instructions LDA and STL ... but I'm somewhat out of depth.
NB: these are the following C / C ++ 11 comparisons with processors , which I believe in, but cannot testify to the truth for the hand.
In any case ... if you are ready to manually load the atomic load / storage, then you can do better.
So ... if the speed of these things really matters, I will be tempted to roll the roll, assuming a limited number of target architectures and noting that:
you still use gcc-specific stuff, so the __sync_compiler() trick does not present an additional portability problem.
The __sync_xxx family __sync_xxx been replaced by the more complete __atomic_xxx in gcc, so if you need to add another target architecture in the future, then you can upgrade to __atomic_xxx .
and in the near future, standard C11 atomization will be publicly available, so portability problems can be solved.