Overhead on a memory barrier / guard

I am currently writing C ++ code and use many memory barriers / protections in my code. I know what MB told the compiler and hardware so that it doesn't change the write / read order around it. But I do not know how complicated this operation is for the processor at runtime.

My question is: what is the workload for such a barrier? I did not find a useful answer using google ... Is the overhead negligible? Or leads to heavy use of MB for serious performance issues?

Sincerely.

+6
c ++
source share
2 answers

Try to think about what the instruction does. This does not make the processor do something complicated in terms of logic, but it makes it wait until all reads and writes are transferred to main memory. Thus, the cost really depends on the cost of accessing the main memory (and the number of outstanding reads / writes).

Access to the main memory is usually quite expensive (10-200 clock cycles), but in a sense this work should be done without a barrier, you can simply hide it by following some other instructions at the same time so that you don’t cost me so much.

It also limits the ability of the processor (and compilers) to reschedule instructions, so there may be indirect costs, as well as the fact that neighboring instructions cannot alternate, which otherwise could lead to a more efficient execution schedule.

+2
source share

Compared to arithmetic and “normal” instructions, I understand that they are very expensive, but do not have numbers to back up this statement. I like jalf answer, describing the effects of instructions and would like to add a little.

In general, there are several different types of barriers, so understanding the differences can be helpful. A barrier such as the one mentioned above is required, for example, in the implementation of the mutex, before clearing the lock word (lwsync on ppc or st4.rel on ia64, for example). All reads and writes must be complete, and only instructions in the pipeline can be executed that do not have access to memory and are not dependent on operations in RAM.

Another type of barrier is the type that you would use in implementing a mutex when acquiring a lock (examples, isync on ppc or instr.acq on ia64). This affects future instructions, so if independent download was pre-programmed, it should be discarded. Example:

  if (pSharedMem-> atomic.bit_is_set ()) // use a bit to flag that somethingElse is "ready"
 {
    foo (pSharedMem-> somethingElse);
 }

Without a protective barrier (borrowing ia64 lingo), your program may have unexpected results if somethingElse entered it into the register before the check of the check bit check is completed.

There is a third type of barrier, usually less used, and is required to ensure store loading order. Examples of instructions for such a force command are: ppc synchronization (heavy synchronization), MF on ia64, membar #storeload on sparc (required even for TSO).

Using ia64 as a pseudo code to illustrate, suppose someone had

  st4.rel
 ld4.acq

without mf between them there is no guarantee that the download follows the repository. You know that loads and stores preceding st4.rel are running before this store or the “next” download, but this download or other future downloads (and possibly stocks if they aren’t dependent?) May sneak up earlier, because nothing bothers otherwise.

Since mutex implementations most likely only use purchase and release restrictions in their implementations, I expect that the observed effect of this is that memory access after locking can sometimes occur when “is still in the critical section”.

+1
source share

All Articles