Programming with C ++ Cache

Is there a way in C ++ to determine the size of the processor cache? I have an algorithm that processes a lot of data, and I would like to break this data into pieces so that it fits into the cache. Is it possible? Can you give me other programming guidelines based on cache size (especially regarding multi-threaded / multi-core data processing)?

Thank!

+55
c ++ optimization caching cpu-cache
Dec 17 '09 at 14:46
source share
12 answers

This is a copy of my answer to another question, but here goes:

Here is a link to really good cache / memory optimization paper from Christer Ericsson (from God of War I / II / III fame).

It's a couple of years, but it's still very relevant.

+37
Dec 17 '09 at 2:51 p.m.
source share

According to “ What Every Programmer Should Know About Memory, ” Ulrich Drepper you can do the following on Linux:

Once we have a formula for memory, we can compare it with the size of the cache. As mentioned earlier, the cache can be shared by several other cores. Nowadays {there’s definitely someday there will soon be a better way!} The only way to get the right information without hard coding is through the / sys file system. In table 5.2, we saw what the kernel publishes hardware. The program should find the directory:

/sys/devices/system/cpu/cpu*/cache 

This is indicated in Section 6: What programs can run .

It also describes a short test right in Figure 6.5, which can be used to determine the size of the L1D cache if you cannot get it from the OS.

There is one more thing that I came across in my article: sysconf(_SC_LEVEL2_CACHE_SIZE) is a Linux system call that should return the size of the L2 cache, although it does not seem to be documented.

+15
Dec 24 '09 at 9:36
source share

C ++ itself does not "care" about CPU caches, so there is no support for cache size queries built into the language. If you are developing for Windows, then GetLogicalProcessorInformation () is a function that can be used to request information about CPU caches.

+11
Dec 17 '09 at 14:54
source share

To predefine a large array. Then, access each item sequentially and record the time for each access. Ideally, there will be a transition to access time when skipping the cache. Then you can calculate your L1 cache. It may not work, but it's worth a try.

+8
Aug 29 '10 at 20:13
source share

read the cpuid of the processor (x86) and then determine the cache size using the lookup table. The table should be populated with cache sizes that the processor manufacturer publishes in its programming guides.

+4
Dec 17 '09 at 14:49
source share

Depending on what you are trying to do, you can also leave it in some library. Since you mention multi-core processing, you can take a look at Intel Threading Building Blocks .

TBB includes cache allocators. In particular, check cache_aligned_allocator (in the help documentation I could not find a direct link).

+4
Dec 17 '09 at 15:48
source share

Interestingly, I wrote a program to do this some time ago (in C though, but I'm sure it will be easy to include in C ++ code).

http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c

The get_cache_line function is interesting, which returns the location before the largest spike in the data access synchronization to the array. He guessed correctly in my car! If something else, this can help you make your own.

This is based on this article, which initially aroused my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/

+4
Aug 29 '10 at 20:40
source share

You can see this topic: http://software.intel.com/en-us/forums/topic/296674

The short answer in this other thread:

On modern IA-32 hardware, the cache line size is 64. A value of 128 is the legacy of Intel Netburst microarchitecture (for example, Intel Pentium D), where 64-byte lines are connected in 128-byte sectors. When a line in a sector is selected, the hardware automatically selects another line in the sector too. Thus, from the point of view of false exchange, the effective line size is 128 bytes on Netburst processors. ( http://software.intel.com/en-us/forums/topic/292721 )

+4
Feb 24 '13 at 2:25
source share

IIRC, GCC has a __builtin_prefetch hint.

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

has a great section. Basically, this involves:

 __builtin_prefetch (&array[i + LookAhead], rw, locality); 

where rw is 0 (prepare for reading) or 1 (prepare for writing) value, and the terrain uses the number 0-3, where zero is not terrain, and 3 is very strong locality.

Both options are optional. LookAhead will be the number of elements to look ahead. If the memory access was 100 cycles and the expanded cycles were divided into two cycles, LookAhead can be set to 50 or 51.

+1
Nov 03 '13 at 3:59
source share

In C ++ 17, you can use std :: hardware_destructive_interference_size to determine the size of the L1 cache. See https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size. As far as I know, only Microsoft Visual Studio 2019 is currently supported.

0
Jun 04 '19 at 20:13
source share

There are two cases that need to be distinguished. Do you need to know cache sizes at compile time or at runtime?

Determining cache size at compile time

For some applications, you know the exact architecture on which your code will run, for example, if you can compile the code directly on the host machine. In this case, simplifying the size search and hard coding is an option (it can be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.

If you want to avoid this complexity or need compilation support on unknown architectures, you can use the C ++ 17 function std :: hardware_constructive_interference_size as a good fallback. This will provide an estimate of the compilation time for the cache line, but be aware of its limitations . Note that the compiler cannot guess exactly when it creates the binary, since the size of the cache line is typically architecture dependent.

Determining Runtime Cache Size

At run time, you have the advantage that you know the exact machine, but you need code for a specific platform to read information from the OS. A good starting point is the code snippet from this answer , which supports major platforms (Windows, Linux, MacOS). Similarly, you can also read the size of the L2 cache at runtime.

I would advise against trying to guess the cache line by running tests at startup and measuring which one works best. This may work well, but is also prone to errors if the processor is used by other processes.

Combination of both approaches

If you need to send one binary file and the machines on which it will work later will have different architectures with different cache sizes, you can create specialized pieces of code for each cache size, and then dynamically (when the application starts) choose the most suitable option. one.

0
Jun 04 '19 at 23:53 on
source share

The cache usually does the right thing. The only real concern for the average programmer is the false sharing, and you cannot take care of this at runtime because it requires compiler directives.

-one
Dec 18 '09 at 5:06
source share



All Articles