Short answer:
In a processor with an external processor, the download-load queue is used to monitor and enforce memory order restrictions. Processors such as the Alpha 21264 have the necessary hardware to prevent reordering of the dependent load, but forced use of this dependency can add overhead for interprocess communication.
Long answer:
Dependency Tracking Background
This is probably best explained with an example. Imagine you had the following sequence of instructions (pseudo-code instructions used for simplicity):
ST R1, A // store value in register R1 to memory at address A LD B, R2 // load value from memory at address B to register R2 ADD R2, 1, R2 // add immediate value 1 to R2 and save result in R2
In this example, there is a relationship between the LD and ADD instructions. ADD reads the value of R2 and therefore cannot execute until LD makes this value available. This dependency is implemented through the register, and this is what the processor logic can track.
However, there may also be a relationship between ST and LD if addresses A and B were the same. But unlike the relationship between LD and ADD , the possible relationship between ST and LD unknown at the time the command is issued (it starts execution).
Instead of detecting memory dependencies at runtime, the processor tracks them using a structure called a boot queue. What this structure does is to keep track of the addresses of pending loads and stores for instructions that have been issued but not yet deleted. If there is a violation of the memory order, this can be detected, and execution can be restarted from the place where the violation occurred.
So, returning to the pseudo-code example, you can imagine a situation where LD is executed before ST (perhaps the value needed for R1 was not ready for some reason). But when ST is executed, he sees that address A and B same. Thus, LD should really read the value that was created by ST , and not the obsolete value that was already in the cache. As a result, LD will need to restart with the instructions that appeared after LD . There are various optimizations to reduce some of this overhead, but the basic idea remains.
As I mentioned earlier, the detection logic for this dependency exists in all processors without order, which allow speculative execution of memory instructions (including Alpha processors).
Memory Ordering Rules
However, memory ordering rules do not just limit the order in which the processor sees results from its own memory operations. Instead, memory sequencing rules limit the relative order of operations of operations with operations performed on one processor and become visible to other processors.
Alpha example
In the case of dependent load reordering, the processor must track this information for its own use, but Alpha ISA does not require other processors to consider this ordering. One example of how this can happen is the following (I quoted this link )
Initially: p = & x, x = 1, y = 0 Thread 1 Thread 2 -------------------------------- y = 1 | memoryBarrier | i = *p p = & y | -------------------------------- Can result in: i = 0
Abnormal behavior is currently only possible based on the 21264 system. And, obviously, you should use one of our multiprocessor servers. Finally, the chances that you actually see this are very low, but it is possible.
Here's what you need to get this behavior to appear. Suppose T1 works on P1 and T2 on P2. P2 must cache the location of y with a value of 0. P1 does y = 1, which results in the transfer of "invalidate y" to P2. This invalidate goes into the P2 incoming test queue; as you want see, the problem arises from the fact that it is theoretically invalid sit in the queue of probes without doing MB on P2. Invalid is confirmed immediately at this stage (i.e. you do not wait for it to actually cancel the copy in the P2 cache before sending the confirmation). Therefore, P1 can go through its MB. And it continues to record on p. Now P2 moves on to reading p. The response for reading p is allowed to bypass the probe queue on P2 in its incoming path (this allows receiving answers / data in order to return to 21264 quickly, without the need to wait for the service of previous incoming probes). Now P2 can derefence P to read the old value of y, which is in the cache (the y invariant in the P2 probe queue is still sitting there).
How does MB work on P2? 21264 resets its incoming probe (that is, serves any pending messages there) in each MB. Therefore, after reading P, you make an MB that pulls in inval to y for sure. And you can no longer see the old cached value for y.
Despite the theoretically possible scenario, the chances are that the problem is extremely small because of this. The reason is that even if you configure caching correctly, P2 will probably have the ability to serve messages (i.e. inval) in its test queue before it receives a data response for "read p". However, if you get into a situation where you have placed a lot of things in the P2 probe, the turn is ahead of inval to y, then it is possible that the answer p comes back and bypasses this intrusion. It will be difficult for you to customize the scenario, although in fact, observe the anomaly.
The above addresses show how the current Alpha can disrupt what you have shown in the picture. Future Alpha may disrupt it due to other optimizations. One interesting optimization is cost prediction.
Summary
The basic hardware needed to ensure ordering of dependent loads is already out of order in all processors. But ensuring that this memory ordering is reviewed by all processors adds additional restrictions to the handling of cache line invalidation. And it can also add additional restrictions in other scenarios. However, in practice, it seems likely that the potential advantages of the weak alpha memory model for hardware designers were not worth the cost of software complexity and added additional costs, requiring additional memory restrictions.