Using valgrind to measure cache misses

Question

Using valgrind to measure cache misses

I have a critical path that runs on a single thread attached to a single core.

I am interested in determining where cache misses occur. Looking around, it seems that the valgrind cachegrind tool will help me. However, I have some questions regarding the capabilities of the tool in this scenario:

How specific are the locations of skipped caches? Does this display a variable name?
Can I profile only one thread?
Is it possible to profile certain parts of the code?
All the options for measuring cache misses, are they equally applicable to TLB misses?
Is it possible to use cachegrind with my release / optimized code?
I understand that valgrind uses a virtual machine to fetch. How accurate is this approach?

Question 1 is the most important.

Any help with command line arguments is best known.

+5

c ++ performance optimization cpu valgrind

user997112 12 sept '15 at 18:17

source share

2 answers

Peter Cordes · Answer 1 · 2015-09-12T22:38:46+0000

cachegrind is not the only way to read processor performance counters. Linux perf and Intel VTune are widely used, and various other interfaces exist.

I did not use cachegrind, just perf , but its output is similar to the fact that it uses perf scripts to write cache skips with the instruction that called them.

How specific are the locations of skipped caches? Does this display a variable name?

At the address of the machine code, that is, a specific instruction. The various interfaces will do a better or worse job, helping you understand what the load or storage actually did, without having to read asm and find out which pointers the processor in which they register is. (They could write down the address that was used to start the counter to determine which line of the cache to go to.)

The mapping from the address of the instruction back to the C ++ source line is not always obvious, in cases where the compiler did some serious transformations of the loop or pulled a constant expression out of the loop after embedding the function that evaluated it every call. In general, I would recommend profiling optimized code. If you are just looking for cache misses, an additional stack used by local users who optimized code in registers will probably not squeeze out many lines from the cache, and loading / storing variables in memory after each use simply refers to cache lines that are already hot. Nevertheless, if you are looking at profiling output as a whole, then pay attention to CPU-time hotspots as part of the process, useful only in optimized code. This is especially so. true for C ++, where it is expected that a large amount of template code will be embedded and optimized.

Is it possible to profile certain parts of the code?

Yes, it should be possible to check the call chain when the primary counter starts, and only count if the current function was ultimately called by one of the functions that you are interested in. Or, enable counting when entering a function of interest. IDK if cachegrind has a good way to do this.

All the options for measuring cache misses, are they equally applicable to TLB misses?

Yup, there are hardware performances for TLB skips and page passes, as well as clock cycles, remote devices, etc.

Intel Sandybridge has 11 PMU counters, so you can try 11 different things in one pass with complete accuracy (i.e. without dividing the time into counters).

serge-sans-paille · Answer 2 · 2015-09-13T10:51:34+0000

cachegrind can display both global and local information about cache flaws and annotate at the line level (if the source program was compiled with debugging information). For example, the following code:

 #include <stdlib.h> #include <stdio.h> #include <stdint.h> int main(int argc, char**argv) { size_t n = (argc == 2 ) ? atoi(argv[1]) : 100; double* v = malloc(sizeof(double) * n); for(size_t i = 0; i < n ; i++) v[i] = i; double s = 0; for(size_t i = 0; i < n ; ++i) s += v[i] * v[n - 1 - i]; printf("%ld\n", s); free(v); return 0; }

compiled with gcc ac -O2 -g -oa and run with outputs valgrind --tool=cachegrind ./a 10000000 :

 ==11551== Cachegrind, a cache and branch-prediction profiler ==11551== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==11551== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info ==11551== Command: ./a 10000000 ==11551== --11551-- warning: L3 cache found, using its data for the LL simulation. 80003072 ==11551== ==11551== I refs: 150,166,282 ==11551== I1 misses: 876 ==11551== LLi misses: 870 ==11551== I1 miss rate: 0.00% ==11551== LLi miss rate: 0.00% ==11551== ==11551== D refs: 30,055,919 (20,041,763 rd + 10,014,156 wr) ==11551== D1 misses: 3,752,224 ( 2,501,671 rd + 1,250,553 wr) ==11551== LLd misses: 3,654,291 ( 2,403,770 rd + 1,250,521 wr) ==11551== D1 miss rate: 12.4% ( 12.4% + 12.4% ) ==11551== LLd miss rate: 12.1% ( 11.9% + 12.4% ) ==11551== ==11551== LL refs: 3,753,100 ( 2,502,547 rd + 1,250,553 wr) ==11551== LL misses: 3,655,161 ( 2,404,640 rd + 1,250,521 wr) ==11551== LL miss rate: 2.0% ( 1.4% + 12.4% )

The I1 skip rates indicate that there were no gaps in the command cache.

The D1 skip frequency tells us that there were many L1 misses in the cache

Missed LL limits tell us that some misses scanned the last level.

To get a more accurate picture of the miss spot, we can run kcachegrind cachegrind.out.11549 , select the L1 Data Read miss and go to the application code as shown

This should answer 1). I think the answer is not 2) 3) and 4). This is yes for 5) if you are compiled with debugging information (without them you will get global information, but not information about each line). Starting from 6), I would say that valgrind usually provides a very decent first approximation. Goig to perf is obviously more accurate!

Using valgrind to measure cache misses

More articles: