Let T be the root binary tree, so that each inner node has exactly two children. The nodes of the tree will be stored in the array, let's call it TreeArray , following the pre-order schedule.
So, for example, if it is a tree that we have: 
Then TreeArray will contain the following node objects:
7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13
A node in this tree is a structure of this type:
struct tree_node{ int id;
Now suppose we need a function that returns the sum of all identifiers in a tree. It sounds very trivial, all you have to do is use the for loop, which TreeArray over the TreeArray and accumulates all the identifiers found. However, I am interested in understanding the behavior of the cache in the following implementation:
void testCache1(int cur){
To test the behavior of the cache, I have the following experiment:
r = 0; //r is a global variable int main(int argc, char* argv[]){ for(int i=0;i<100;i++) { r = 0; testCache1(0); } cout<<r<<endl; return 0; }
For a random tree with 5 million leaves, perf stat -B -e cache-misses,cache-references,instructions ./run_tests 111.txt prints the following:
Performance counter stats for './run_tests 111.txt': 469,511,047 cache-misses
At the beginning, I thought that maybe this is because of how I generate a tree that I exclude, including it in my question, but when I run sudo perf record -e cache-misses ./run_tests 111.txt , I got the following output:

As we can see, most cache misses come from this feature. However, I cannot understand why this is so. The cur values ββwill be consecutive, I will first get position 0 of TreeArray , then position 1 , 2 , 3 , etc.
To add more doubt to my understanding of what is happening, I have the following function that finds the same summation:
void testCache4(int index){ if(index == TreeArray.size) return; r += TreeArray[index].id; testCache4(index+1); }
testCache4 similarly accesses TreeArray elements, but cache behavior is much better.
output from perf stat -B -e cache-misses,cache-references,instructions ./run_tests 11.txt :
Performance counter stats for './run_tests 111.txt': 396,941,872 cache-misses
upon exiting sudo perf record -e cache-misses ./run_tests 111.txt function does not even exist:

I apologize for the long post, but I feel completely lost. Thanks in advance.
EDIT:
Here is the entire test file, along with the parsers and all that is required. The tree is assumed to be accessible inside the text file specified as an argument. Compile by typing g++ -O3 -std=c++11 file.cpp and run by typing ./executable tree.txt . The tree that I use can be found here (do not open, click "Save").
#include <iostream>
EDIT2:
I am running some profiling tests with valgrind. In fact, the instructions may be overhead, but I do not understand why. For example, even in the above performance experiments, one version gives about 20 billion instructions, and the remaining 11 billion. This is a difference of 9 billion.
With -O3 turned on, I get the following:

therefore, calls to road functions in testCache1 and cost nothing in testCache4 ? The number of function calls in both cases should be the same ...