Rearrange dependent loads in CPU

I read Memory Barriers: Hardware for Software Hackers , a very popular article by Paul E. McKenney.

One of the things that the article emphasizes is that very poorly ordered processors like Alpha can change the order of dependent loads, which seems to be a side effect of the partitioned cache

Paper fragment:

1 struct el *insert(long key, long data) 2 { 3 struct el *p; 4 p = kmalloc(sizeof(*p), GPF_ATOMIC); 5 spin_lock(&mutex); 6 p->next = head.next; 7 p->key = key; 8 p->data = data; 9 smp_wmb(); 10 head.next = p; 11 spin_unlock(&mutex); 12 } 13 14 struct el *search(long key) 15 { 16 struct el *p; 17 p = head.next; 18 while (p != &head) { 19 /* BUG ON ALPHA!!! */ 20 if (p->key == key) { 21 return (p); 22 } 23 p = p->next; 24 }; 25 return (NULL); 26 } 
  • There are 2 processors CPU0 and CPU1.
  • Each processor has 2 banks of cache CB0 (odd address), CB1 (even address).
  • The head is in CB0 and P in CB1.
  • Insert () has a write barrier that ensures that invalidation for line 6-8 is the first bus, followed by invalidation on line 10.
  • However, another search processor may have CB0 lightly loaded and CB1 heavily loaded.
  • This means that the processor outputs the last value of the head, but the old value of p (since the revocation request for p is not yet processed by CB1.)

Question: It seems that all architectures expect Alpha to be load dependent. For example: IA64 can reorder, with the exception of reordering dependent loads.

  • Download redefined after download
  • Download redefined after saving
  • Stores reorder after stores
  • Saved after loading
  • Atomic instruction reordered with loads.
  • Atomic instructions are reordered using repositories.

This makes me wonder what kind of hardware support is required to prevent reordering of the dependent load.

One possible answer is that the whole other architecture (IA64) does not have a partitioned cache and, therefore, will not work in this problem and does not require explicit hardware support.

Any ideas?

+6
source share
1 answer

Short answer:

In a processor with an external processor, the download-load queue is used to monitor and enforce memory order restrictions. Processors such as the Alpha 21264 have the necessary hardware to prevent reordering of the dependent load, but forced use of this dependency can add overhead for interprocess communication.

Long answer:

Dependency Tracking Background

This is probably best explained with an example. Imagine you had the following sequence of instructions (pseudo-code instructions used for simplicity):

 ST R1, A // store value in register R1 to memory at address A LD B, R2 // load value from memory at address B to register R2 ADD R2, 1, R2 // add immediate value 1 to R2 and save result in R2 

In this example, there is a relationship between the LD and ADD instructions. ADD reads the value of R2 and therefore cannot execute until LD makes this value available. This dependency is implemented through the register, and this is what the processor logic can track.

However, there may also be a relationship between ST and LD if addresses A and B were the same. But unlike the relationship between LD and ADD , the possible relationship between ST and LD unknown at the time the command is issued (it starts execution).

Instead of detecting memory dependencies at runtime, the processor tracks them using a structure called a boot queue. What this structure does is to keep track of the addresses of pending loads and stores for instructions that have been issued but not yet deleted. If there is a violation of the memory order, this can be detected, and execution can be restarted from the place where the violation occurred.

So, returning to the pseudo-code example, you can imagine a situation where LD is executed before ST (perhaps the value needed for R1 was not ready for some reason). But when ST is executed, he sees that address A and B same. Thus, LD should really read the value that was created by ST , and not the obsolete value that was already in the cache. As a result, LD will need to restart with the instructions that appeared after LD . There are various optimizations to reduce some of this overhead, but the basic idea remains.

As I mentioned earlier, the detection logic for this dependency exists in all processors without order, which allow speculative execution of memory instructions (including Alpha processors).

Memory Ordering Rules

However, memory ordering rules do not just limit the order in which the processor sees results from its own memory operations. Instead, memory sequencing rules limit the relative order of operations of operations with operations performed on one processor and become visible to other processors.

Alpha example

In the case of dependent load reordering, the processor must track this information for its own use, but Alpha ISA does not require other processors to consider this ordering. One example of how this can happen is the following (I quoted this link )

 Initially: p = & x, x = 1, y = 0 Thread 1 Thread 2 -------------------------------- y = 1 | memoryBarrier | i = *p p = & y | -------------------------------- Can result in: i = 0 

Abnormal behavior is currently only possible based on the 21264 system. And, obviously, you should use one of our multiprocessor servers. Finally, the chances that you actually see this are very low, but it is possible.

Here's what you need to get this behavior to appear. Suppose T1 works on P1 and T2 on P2. P2 must cache the location of y with a value of 0. P1 does y = 1, which results in the transfer of "invalidate y" to P2. This invalidate goes into the P2 incoming test queue; as you want see, the problem arises from the fact that it is theoretically invalid sit in the queue of probes without doing MB on P2. Invalid is confirmed immediately at this stage (i.e. you do not wait for it to actually cancel the copy in the P2 cache before sending the confirmation). Therefore, P1 can go through its MB. And it continues to record on p. Now P2 moves on to reading p. The response for reading p is allowed to bypass the probe queue on P2 in its incoming path (this allows receiving answers / data in order to return to 21264 quickly, without the need to wait for the service of previous incoming probes). Now P2 can derefence P to read the old value of y, which is in the cache (the y invariant in the P2 probe queue is still sitting there).

How does MB work on P2? 21264 resets its incoming probe (that is, serves any pending messages there) in each MB. Therefore, after reading P, you make an MB that pulls in inval to y for sure. And you can no longer see the old cached value for y.

Despite the theoretically possible scenario, the chances are that the problem is extremely small because of this. The reason is that even if you configure caching correctly, P2 will probably have the ability to serve messages (i.e. inval) in its test queue before it receives a data response for "read p". However, if you get into a situation where you have placed a lot of things in the P2 probe, the turn is ahead of inval to y, then it is possible that the answer p comes back and bypasses this intrusion. It will be difficult for you to customize the scenario, although in fact, observe the anomaly.

The above addresses show how the current Alpha can disrupt what you have shown in the picture. Future Alpha may disrupt it due to other optimizations. One interesting optimization is cost prediction.

Summary

The basic hardware needed to ensure ordering of dependent loads is already out of order in all processors. But ensuring that this memory ordering is reviewed by all processors adds additional restrictions to the handling of cache line invalidation. And it can also add additional restrictions in other scenarios. However, in practice, it seems likely that the potential advantages of the weak alpha memory model for hardware designers were not worth the cost of software complexity and added additional costs, requiring additional memory restrictions.

+6
source

All Articles