AMD Opteron Cache Coherence Protocol (MOESI?)

If I can start with an example.

Let's say we have a 4-socket system where each socket has 4 cores and each socket has 2 GB of RAM ccNUMA (cache-coherent non-uniform memory access) memory type.

Let's say that 4 processes work on each socket, and everyone has a shared memory area allocated in RAM P2, designated as SHM. This means that any loading / saving in this region will lead to a search in the P2 directory, right? If so, then ... When does this happen, is this the equivalent of access to RAM in terms of delay? Where is this directory physically located? (See below)

With a more specific example: Let's say P2 executes LOAD on SHM, and this data is cast into P2 cache L3 with the tag '(O) wner'. Also, let's say P4 does LOAD on the same SHM. This will force P4 to search in the P2 directory, and since the data is marked as β€œPrivate” P2, my question is:

Can P4 get SHM from P2 RAM or ALWAYS get data from P2 L3 cache?

If it always receives data from the L3 cache, will it not be faster to receive data directly from P2 RAM? So how should he already look in the P2 directory? And I understand that the directory literally sits on top of RAM.

Sorry if I grossly misunderstand what is going on here, but I hope someone can help clarify this.

Also, is there any data on how fast such a directory looks? From the point of view of data retrieval, is there documentation on the average delays in such searches? How many L3 read cycles, read-read, directory searches? and etc.

+4
source share
1 answer

It depends on whether the Opteron processor implements the HT Assist mechanism.

If this is not the case, then there is no directory. In your example, when P4 gives the load, the memory request will go to the P2 memory controller. P2 will respond with a cache line and also send a test message to the other two cores. Finally, these two other cores will respond to P4 with an ACK, stating that they do not have a copy of the cache line.

If HT Assist support is enabled (usually for 6-core and higher sockets), then each L3 cache contains a snoop filter (directory) used to record which kernels support the line. Thus, in your example, P4 does not send test messages to the other two cores, as it searches the HT Assist directory to find out that someone else does not have a copy of the string (this is a simplification, since the state of the string will be Exclusive instead of Owned, and directory search is not required).

+2
source

All Articles