Inspired by SQLite , I look at the valgrind tool "cachegrind" to make reproducible benchmarking performance. The displayed numbers are much more stable than any other method of determining the time that I found, but they are still not deterministic. As an example, here is a simple C program:
int main() { volatile int x; while (x < 1000000) { x++; } }
If I compile it and run it in cachegrind, I get the following results:
$ gcc -O2 xc -ox $ valgrind --tool=cachegrind ./x ==11949== Cachegrind, a cache and branch-prediction profiler ==11949== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al. ==11949== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info ==11949== Command: ./x ==11949== --11949-- warning: L3 cache found, using its data for the LL simulation. ==11949== ==11949== I refs: 11,158,333 ==11949== I1 misses: 3,565 ==11949== LLi misses: 2,611 ==11949== I1 miss rate: 0.03% ==11949== LLi miss rate: 0.02% ==11949== ==11949== D refs: 4,116,700 (3,552,970 rd + 563,730 wr) ==11949== D1 misses: 21,119 ( 19,041 rd + 2,078 wr) ==11949== LLd misses: 7,487 ( 6,148 rd + 1,339 wr) ==11949== D1 miss rate: 0.5% ( 0.5% + 0.4% ) ==11949== LLd miss rate: 0.2% ( 0.2% + 0.2% ) ==11949== ==11949== LL refs: 24,684 ( 22,606 rd + 2,078 wr) ==11949== LL misses: 10,098 ( 8,759 rd + 1,339 wr) ==11949== LL miss rate: 0.1% ( 0.1% + 0.2% ) $ valgrind --tool=cachegrind ./x ==11982== Cachegrind, a cache and branch-prediction profiler ==11982== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al. ==11982== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info ==11982== Command: ./x ==11982== --11982-- warning: L3 cache found, using its data for the LL simulation. ==11982== ==11982== I refs: 11,159,225 ==11982== I1 misses: 3,611 ==11982== LLi misses: 2,611 ==11982== I1 miss rate: 0.03% ==11982== LLi miss rate: 0.02% ==11982== ==11982== D refs: 4,117,029 (3,553,176 rd + 563,853 wr) ==11982== D1 misses: 21,174 ( 19,090 rd + 2,084 wr) ==11982== LLd misses: 7,496 ( 6,154 rd + 1,342 wr) ==11982== D1 miss rate: 0.5% ( 0.5% + 0.4% ) ==11982== LLd miss rate: 0.2% ( 0.2% + 0.2% ) ==11982== ==11982== LL refs: 24,785 ( 22,701 rd + 2,084 wr) ==11982== LL misses: 10,107 ( 8,765 rd + 1,342 wr) ==11982== LL miss rate: 0.1% ( 0.1% + 0.2% ) $
In this case, βI refsβ differs from just 0.008% between the two starts, but I still wonder why they are different. In more complex programs (tens of milliseconds) they can change more. Is there a way to make runs fully reproducible?