VIPT cache: connection between TLB and cache?

I just want to clarify the concept and find sufficiently detailed answers that can shed light on how everything actually works in hardware. Please provide all necessary information.

In the case of VIPT caches, a memory request is sent in parallel with both the TLB and the cache.

In the TLB, we get the traslated physical address. From cache indexing, we get a list of tags (for example, from all cache lines belonging to the set).

The translated TLB is then mapped to a list of tags to find a candidate.

  • My question is: where is this check done?
    • In the cache?
    • If not in the cache, where else?
  • If the check is performed in the cache, then
    • Is there a sideband connection from the TLB to the Cache module to get the translated physical address needed to compare with tag addresses?

Could someone please shed light on the "really" how this is generally implemented and the connection between the Cache module and the TLB module (MMU)?

I know these dependent individuals on a particular architecture and implementation. But what is the implementation you know when there is a VIPT cache?

Thanks.

+2
caching cpu-cache cpu-architecture tlb mmu
source share
1 answer

At this level of detail, you must break the "cache" and "TLB" into their component parts . They are very closely related to each other in a design that uses VIPT hack speed in parallel with tag extraction (i.e., using all index bits that are below the page offset and thus translate β€œfor free.” Related: Why is the cache size L1 smaller than the size of the L2 cache on most processors? )

L1dTLB itself is a small / fast Content addressable memory with (for example) 64 entries and a 4-band associative association ( Intel Skylake ). Hugepages are often processed by a second (and third) array checked in parallel, for example. 32-position 4-position for 2M pages and for 1G pages: 4-position fully (4-sided) associative.

But now simplify your mental model and forget about the huge pages. L1dTLB is a single CAM and verifies that this is the only search operation.

A "cache" consists of at least the following parts:

  • SRAM array that stores tags + data in sets
  • control logic for retrieving a dataset + tag based on index bits. (High-performance L1d caches typically retrieve data for all typing methods in parallel with tags to reduce latency versus waiting until the correct tag is selected, as if you were using larger associative caches.)
  • to check tags on the translated address and select the necessary data if one of them matches, or start error handling. (And on impact, update the LRU bits to mark this method as the most recent)

L1dTLB is not really separate from the L1D cache. I don't really design hardware, but I think that the boot execution block in a modern high-performance design might work something like this:

  • AGU generates an address from registers (registers) + offset.

    (Satisfactory fact: Sandybridge-family shortens this process for a simple addressing mode: [reg + 0-2047] has a lower load latency of 1 s than other addressing modes, most likely when the page bit is ready to work. Perhaps using an adder, x and x + 4096 based on low-order execution or something else?)

  • Index bits are used to extract all types of typing (because they are all offset within the page, so they do not need translation).
  • High address bits are scanned in the L1dTLB CAM array.
  • The tag comparator receives the translated physical address tag and the extracted tags from this set.
  • If there is a match, the cache extracts the correct bytes from the data for the matching method (using offset bits)

If there is no match in CAM L1dTLB, the entire cache fetch operation cannot continue. I’m not sure how or how processors can convey the pipeline, so that other loads can continue to execute until TLB-miss is resolved. This process involves checking L2TLB (Skylake: Unified Record 1536 12-way for 4k and 2M, 16-entry for 1G), and if that doesn't work, then when viewing the page.

I assume that a TLB miss results in a tag + data drop. They will be re-displayed after the desired translation is found. There is nowhere to store them while other loads work.

In the simplest case, it can simply restart the entire operation (including selecting a translation from L1dTLB) when the translation is ready, but it can reduce the delay for L2TLB calls by briefly shortening the process and using the translation directly instead of putting it into L1dTLB and returning it again his.

Obviously, dTLB and L1D are required to be developed together and tightly integrated. Since they only need to talk to each other, this makes sense. Navigating through the pages of equipment displays data through the L1D cache. (Page tables always have physical addresses to avoid the problem of catch 22 / chicken-egg).

Is there a side connection from TLB to the cache?

I would not call it a side-channel connection. The L1D cache is the only one that uses L1dTLB. Likewise, L1iTLB is used only by the L1I cache.

If there is a Level 2 TLB, it is usually unified, so both L1iTLB and L1dTLB check this if they miss. Just like split L1I and L1D, a single L2 cache is usually checked if they miss.

External caches (L2, L3) are pretty universal PIPTs. Translation occurs during L1 verification, so physical addresses can be sent to other caches.

+4
source share

All Articles