cachegrind can display both global and local information about cache flaws and annotate at the line level (if the source program was compiled with debugging information). For example, the following code:
#include <stdlib.h> #include <stdio.h> #include <stdint.h> int main(int argc, char**argv) { size_t n = (argc == 2 ) ? atoi(argv[1]) : 100; double* v = malloc(sizeof(double) * n); for(size_t i = 0; i < n ; i++) v[i] = i; double s = 0; for(size_t i = 0; i < n ; ++i) s += v[i] * v[n - 1 - i]; printf("%ld\n", s); free(v); return 0; }
compiled with gcc ac -O2 -g -oa and run with outputs valgrind --tool=cachegrind ./a 10000000 :
==11551== Cachegrind, a cache and branch-prediction profiler ==11551== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==11551== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info ==11551== Command: ./a 10000000 ==11551== --11551-- warning: L3 cache found, using its data for the LL simulation. 80003072 ==11551== ==11551== I refs: 150,166,282 ==11551== I1 misses: 876 ==11551== LLi misses: 870 ==11551== I1 miss rate: 0.00% ==11551== LLi miss rate: 0.00% ==11551== ==11551== D refs: 30,055,919 (20,041,763 rd + 10,014,156 wr) ==11551== D1 misses: 3,752,224 ( 2,501,671 rd + 1,250,553 wr) ==11551== LLd misses: 3,654,291 ( 2,403,770 rd + 1,250,521 wr) ==11551== D1 miss rate: 12.4% ( 12.4% + 12.4% ) ==11551== LLd miss rate: 12.1% ( 11.9% + 12.4% ) ==11551== ==11551== LL refs: 3,753,100 ( 2,502,547 rd + 1,250,553 wr) ==11551== LL misses: 3,655,161 ( 2,404,640 rd + 1,250,521 wr) ==11551== LL miss rate: 2.0% ( 1.4% + 12.4% )
The I1 skip rates indicate that there were no gaps in the command cache.
The D1 skip frequency tells us that there were many L1 misses in the cache
Missed LL limits tell us that some misses scanned the last level.
To get a more accurate picture of the miss spot, we can run kcachegrind cachegrind.out.11549 , select the L1 Data Read miss and go to the application code as shown 
This should answer 1). I think the answer is not 2) 3) and 4). This is yes for 5) if you are compiled with debugging information (without them you will get global information, but not information about each line). Starting from 6), I would say that valgrind usually provides a very decent first approximation. Goig to perf is obviously more accurate!