Does the entire block in the cache access one member of the structure?

I read Ulrich Drapper's β€œ What Every Programmer Should Know About Memory ” and in section 3.3.2 Measuring Cache Effects (halfway down the page) it seems that accessing any member of the structure causes the entire structure to be pulled into the cache CPU

It is right? If so, how do the hardware know about the layout of these structures? Or does the code generated by the compiler somehow make the whole structure load?

Or are slowdowns in using larger structures mainly due to TLB gaps caused by the spread of structures on other pages of memory?

An example of the structure used by Drepper:

struct l { struct l *n; long int pad[NPAD]; }; 

Where sizeof(l) is defined by NPAD equal to 0, 7, 15 or 31, which leads to the formation of structures separated by 0, 56, 120 and 248 bytes, and assumes that the cache lines are 64 bytes and 4k pages.

Simple repetition through a linked list becomes much slower as the structure grows, although nothing really happens except the pointer.

+7
c linux caching memory
source share
6 answers

The hardware is not aware of the structure at all. But it’s true that the hardware load in the cache is a few bytes around the bytes that you are actually accessing. This is because the cache length has a size. It does not work with byte byte access, but for example. Size 16 bytes at a time.

You should be careful when ordering members of the structure so that frequently used members are close to each other. For example, if you have the following structure:

 struct S { int foo; char name[64]; int bar; }; 

If the foo and bar member variables are used very often, the hardware will cache the bytes around foo, and when you access the bar, it will have to load bytes around the bar. Even if these bytes around foo and around the bar are never used. Now rewrite the structure as follows:

 struct S { int foo; int bar; char name[64]; }; 

When you use foo, the equipment will cache the bytes around foo. When you use bar, bar will already be in the cache, because bar is contained in bytes around foo. The CPU does not have to wait until the bar is in the cache.

Answer : access to one member of the structure does not stretch the entire structure in the cache, but inserts another member of the structure into the cache.

+8
source share

The hardware does not know the layout of the structure, but just loads a few bytes around the available element into the cache. And yes, the slowdown from larger structures is due to the fact that they will be distributed over more cache lines.

+8
source share

Access to a member of the structure no longer leads to a decrease in performance than access to any other area of ​​memory. In fact, there may be a performance improvement if you access multiple members of the structure in the same area, since other members can be cached with first access.

+3
source share

Typically, the L1 cache uses virtual addresses , if you get access to the struct member, a certain number of bytes get into the cache (one cache line , usually from 8 to 512 bytes in size). Since all members of the struct aligned side by side in memory, the likelihood that the entire structure will fall into the cache is somewhat larger (depends on sizeof(struct your_struct) ) ...

+1
source share

While the CPU can happily handle loads and stores up to one byte in size, caches only ever deal with cacheline-sized data. In textbooks on computer architecture, this is also known as "block size."

On most systems, this is 32 or 64 bytes. It can differ from one processor to another, and sometimes from one cache level to the next.

In addition, some processors perform speculative prefetching; this means that if you access lines 5 and 6 sequentially, he will try to load cache line 7 without your request.

+1
source share

"Simple repetition through a linked list becomes much slower as the structure grows, even if you actually pay attention to some other pointer."

With NPAD = 0, each cache line contains 8 list nodes, so you can understand why this is the fastest.

With NPAD = 7, 15, 31, for each node list, you need to load only one cache line, and you can expect all of them to be at the same speed - one cache miss per node. But a modern memory manager will do speculative caching. If it has a spare capacity (which is probably what happens because it can perform several readings in parallel with the main memory with modern memory), then it will start loading the memory next to the memory you are using. Although this is a linked list, if you built it in any obvious way, there is a good chance that you are accessing memory sequentially. So, the closer your list nodes are in memory, the more successful will be the cache that already has what you need.

In the worst case scenario, when your memory is removed from swap as it is used, your program will be limited to disk I / O. Perhaps your progress rate through the list will be completely determined by how many nodes on each page, and you can see that the time, which is directly proportional to the size of the node, is up to 4k. However, I have not tried this, and the OS will be smarter with a replacement, since the MMU is smart with main memory, so this is not necessarily so simple.

+1
source share

All Articles