Understanding how the CPU decides what is loaded into the cache

Suppose a computer has 64k L1 cache and 512k L2 cache.

The programmer created / populated an array of 10 million data in the main memory (for example, vertex / index data of a 3d model).

An array may contain a number of structures, such as:

struct x { vec3 pos; vec3 normal; vec2 texcoord; }; 

Then the programmer must perform some operation for all this data, for example. one-time normal calculation before transferring data to the GPU.

How does the processor decide how data is loaded into the L2 cache?

How can a programmer check what cache line size is for any given architecture?

How can a programmer ensure that data is organized so that it fits into cache lines?

Is data binding to byte boundaries the only thing that can be done to help this process?

What can a programmer do to minimize cache misses?

What profiling tools are available to help visualize the optimization process for windows and Linux platforms?

+5
c ++ optimization caching
source share
1 answer

There are many questions here, so I will keep the answers short.

How does the processor decide how data is loaded into the L2 cache?

Everything you use is loading. L2 behaves in the same way as L1, except that there is more of it, and smoothing (which can lead to premature eviction) is more common due to large lines and a smaller population. Some CPUs load only L2 with data that is pushed from L1, but this does not matter much to the programmer.

Most MMUs have a facility for undisclosed memory, but this is for device drivers. I don't remember ever seeing an opportunity to disable L2 without disconnecting L1. Without caching, you do not get performance.

How can a programmer check what cache line size is for any given architecture?

Consult the user manual. Some operating systems provide a request object, such as sysctl .

How can a programmer ensure that data is organized so that it fits into cache lines?

The basic idea is spatial locality. Data accessed simultaneously, by the same internal loop, must be included in the same data structure. The optimal organization is to put this structure in the cache line and align it with the size of the cache line.

Do not worry if you do not carefully use your profiler as a guide.

Is data binding to byte boundaries the only thing that can be done to help this process?

No, the other part avoids filling the cache with extraneous data. If some fields will be used only by some other algorithm, they lose cache space while the real algorithm is running. But you cannot optimize all the time, and reorganizing data structures requires programming efforts.

What can a programmer do to minimize cache misses?

Profile using real world data and treat excessive misses as an error.

What profiling tools are available to help visualize the optimization process for windows and Linux platforms?

Cachegrind is very good, but uses a virtual machine. Intel V-Tune uses your real hardware, for better or for worse. I did not use the latter.

+12
source share

All Articles