Why does this recursive C ++ function have this behavior in bad cache?

Question

Why does this recursive C ++ function have this behavior in bad cache?

Let T be the root binary tree, so that each inner node has exactly two children. The nodes of the tree will be stored in the array, let's call it TreeArray , following the pre-order schedule.

So, for example, if it is a tree that we have:

Then TreeArray will contain the following node objects:

7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13

A node in this tree is a structure of this type:

 struct tree_node{ int id; //id of the node, randomly generated int numChildren; //number of children, it is 2 but for the leafs it 0 int pos; //position in TreeArray where the node is stored int lpos; //position of the left child int rpos; //position of the right child tree_node(){ id = -1; pos = lpos = rpos = -1; numChildren = 0; } };

Now suppose we need a function that returns the sum of all identifiers in a tree. It sounds very trivial, all you have to do is use the for loop, which TreeArray over the TreeArray and accumulates all the identifiers found. However, I am interested in understanding the behavior of the cache in the following implementation:

 void testCache1(int cur){ //find the positions of the left and right children int lpos = TreeArray[cur].lpos; int rpos = TreeArray[cur].rpos; //if there are no children we are at a leaf so update r and return if(TreeArray[cur].numChildren == 0){ r += TreeArray[cur].id; return; } //otherwise we are in an internal node, so update r and recurse //first to the left subtree and then to the right subtree r += TreeArray[cur].id; testCache1(lpos); testCache1(rpos); }

To test the behavior of the cache, I have the following experiment:

 r = 0; //r is a global variable int main(int argc, char* argv[]){ for(int i=0;i<100;i++) { r = 0; testCache1(0); } cout<<r<<endl; return 0; }

For a random tree with 5 million leaves, perf stat -B -e cache-misses,cache-references,instructions ./run_tests 111.txt prints the following:

  Performance counter stats for './run_tests 111.txt': 469,511,047 cache-misses # 89.379 % of all cache refs 525,301,814 cache-references 20,715,360,185 instructions 11.214075268 seconds time elapsed

At the beginning, I thought that maybe this is because of how I generate a tree that I exclude, including it in my question, but when I run sudo perf record -e cache-misses ./run_tests 111.txt , I got the following output:

As we can see, most cache misses come from this feature. However, I cannot understand why this is so. The cur values will be consecutive, I will first get position 0 of TreeArray , then position 1 , 2 , 3 , etc.

To add more doubt to my understanding of what is happening, I have the following function that finds the same summation:

 void testCache4(int index){ if(index == TreeArray.size) return; r += TreeArray[index].id; testCache4(index+1); }

testCache4 similarly accesses TreeArray elements, but cache behavior is much better.

output from perf stat -B -e cache-misses,cache-references,instructions ./run_tests 11.txt :

  Performance counter stats for './run_tests 111.txt': 396,941,872 cache-misses # 54.293 % of all cache refs 731,109,661 cache-references 11,547,097,924 instructions 4.306576556 seconds time elapsed

upon exiting sudo perf record -e cache-misses ./run_tests 111.txt function does not even exist:

I apologize for the long post, but I feel completely lost. Thanks in advance.

EDIT:

Here is the entire test file, along with the parsers and all that is required. The tree is assumed to be accessible inside the text file specified as an argument. Compile by typing g++ -O3 -std=c++11 file.cpp and run by typing ./executable tree.txt . The tree that I use can be found here (do not open, click "Save").

 #include <iostream> #include <fstream> #define BILLION 1000000000LL using namespace std; /* * * Timing functions * */ timespec startT, endT; void startTimer(){ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT); } double endTimer(){ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT); return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec); } /* * * tree node * */ //this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers struct tree_node_temp{ int id; //id of the node, randomly generated int numChildren; //number of children, it is 2 but for the leafs it 0 int size; //size of the subtree rooted at the current node tree_node_temp *leftChild; tree_node_temp *rightChild; tree_node_temp(){ id = -1; size = 1; leftChild = nullptr; rightChild = nullptr; numChildren = 0; } }; struct tree_node{ int id; //id of the node, randomly generated int numChildren; //number of children, it is 2 but for the leafs it 0 int size; //size of the subtree rooted at the current node int pos; //position in TreeArray where the node is stored int lpos; //position of the left child int rpos; //position of the right child tree_node(){ id = -1; pos = lpos = rpos = -1; numChildren = 0; } }; /* * * Tree parser. The input is a file containing the tree in the newick format. * */ string treeNewickStr; //string storing the newick format of a tree that we read from a file int treeCurSTRindex; //index to the current position we are in while reading the newick string int treeNumLeafs; //number of leafs in current tree tree_node ** treeArrayReferences; //stack of references to free node objects tree_node *treeArray; //array of node objects int treeStackReferencesTop; //the top index to the references stack int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree //helper function for readNewick tree_node_temp* readNewickHelper() { int i; if(treeCurSTRindex == treeNewickStr.size()) return nullptr; tree_node_temp * leftChild; tree_node_temp * rightChild; if(treeNewickStr[treeCurSTRindex] == '('){ //create a left child treeCurSTRindex++; leftChild = readNewickHelper(); } if(treeNewickStr[treeCurSTRindex] == ','){ //create a right child treeCurSTRindex++; rightChild = readNewickHelper(); } if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){ treeCurSTRindex++; tree_node_temp * cur = new tree_node_temp(); cur->numChildren = 2; cur->leftChild = leftChild; cur->rightChild = rightChild; cur->size = 1 + leftChild->size + rightChild->size; return cur; } //we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format) i = 0; char treeLabel[20]; //buffer used for the label while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){ treeLabel[i] = treeNewickStr[treeCurSTRindex]; treeCurSTRindex++; i++; } treeLabel[i] = '\0'; tree_node_temp * cur = new tree_node_temp(); cur->numChildren = 0; cur->id = atoi(treeLabel)-1; treeNumLeafs++; return cur; } //create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser void treeInit(tree_node_temp * curRoot){ tree_node * curFinalRoot = treeArrayReferences[curpos]; curFinalRoot->pos = curpos; if(curRoot->numChildren == 0) { curFinalRoot->id = curRoot->id; return; } //add left child tree_node * cnode = treeArrayReferences[treeStackReferencesTop]; curFinalRoot->lpos = curpos + 1; curpos = curpos + 1; treeStackReferencesTop++; cnode->id = curRoot->leftChild->id; treeInit(curRoot->leftChild); //add right child curFinalRoot->rpos = curpos + 1; curpos = curpos + 1; cnode = treeArrayReferences[treeStackReferencesTop]; treeStackReferencesTop++; cnode->id = curRoot->rightChild->id; treeInit(curRoot->rightChild); curFinalRoot->id = curRoot->id; curFinalRoot->numChildren = 2; curFinalRoot->size = curRoot->size; } //the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal void updateInternalNodeIDs(int cur){ tree_node* curNode = treeArrayReferences[cur]; if(curNode->numChildren == 0){ return; } curNode->id = treeNumLeafs++; updateInternalNodeIDs(curNode->lpos); updateInternalNodeIDs(curNode->rpos); } //frees the memory of the first tree generated by the parser void treeFreeMemory(tree_node_temp* cur){ if(cur->numChildren == 0){ delete cur; return; } treeFreeMemory(cur->leftChild); treeFreeMemory(cur->rightChild); delete cur; } //reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree. //this tree is scattered anywhere in the memory. tree_node* readNewick(string& file){ treeCurSTRindex = -1; treeNewickStr = ""; treeNumLeafs = 0; ifstream treeFin; treeFin.open(file, ios_base::in); //read the newick format of the tree and store it in a string treeFin>>treeNewickStr; //initialize index for reading the string treeCurSTRindex = 0; //create the tree in main memory tree_node_temp* root = readNewickHelper(); //store the tree in an array following the pre order layout treeArray = new tree_node[root->size]; treeArrayReferences = new tree_node*[root->size]; int i; for(i=0;i<root->size;i++) treeArrayReferences[i] = &treeArray[i]; treeStackReferencesTop = 0; tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop]; curpos = treeStackReferencesTop; treeStackReferencesTop++; finalRoot->id = root->id; treeInit(root); //update the internal node ids (the leaf ids are defined by the ids stored in the newick string) updateInternalNodeIDs(0); //close the file treeFin.close(); //free the memory of initial tree treeFreeMemory(root); //return the pre order tree return finalRoot; } /* * experiments * */ int r; tree_node* T; void testCache1(int cur){ int lpos = treeArray[cur].lpos; int rpos = treeArray[cur].rpos; if(treeArray[cur].numChildren == 0){ r += treeArray[cur].id; return; } r += treeArray[cur].id; testCache1(lpos); testCache1(rpos); } void testCache4(int index){ if(index == T->size) return; r += treeArray[index].id; testCache4(index+1); } int main(int argc, char* argv[]){ string Tnewick = argv[1]; T = readNewick(Tnewick); double tt; startTimer(); for(int i=0;i<100;i++) { r = 0; testCache4(0); } tt = endTimer(); cout<<r<<endl; cout<<tt/BILLION<<endl; startTimer(); for(int i=0;i<100;i++) { r = 0; testCache1(0); } tt = endTimer(); cout<<r<<endl; cout<<tt/BILLION<<endl; delete[] treeArray; delete[] treeArrayReferences; return 0; }

EDIT2:

I am running some profiling tests with valgrind. In fact, the instructions may be overhead, but I do not understand why. For example, even in the above performance experiments, one version gives about 20 billion instructions, and the remaining 11 billion. This is a difference of 9 billion.

With -O3 turned on, I get the following:

therefore, calls to road functions in testCache1 and cost nothing in testCache4 ? The number of function calls in both cases should be the same ...

+7

c ++ caching tree

jsguy Sep 7 '16 at 16:14

source share

1 answer

user4407569 · Accepted Answer · 2016-09-08T17:02:07+0000

My guess is that the problem is not understanding which cache links really count.

As explained in this answer , cache links on Intel processors are actually the number of links to the last level cache. Therefore, memory references that were served by the L1 cache are not taken into account. The Intel 64 and IA-32 Architecture Developers Guide states that counts from the L1 prefactor are counted.

If you really compare the absolute number of misses, you will see that they are approximately equal for both functions. I used a fully balanced tree for the test, removed pos to get a size of 16 that fit well in the cache lines, and got the following numbers:

testCache4 :

 843.628.131 L1-dcache-loads (56,83%) 193.006.858 L1-dcache-load-misses # 22,73% of all L1-dcache hits (57,31%) 326.698.621 cache-references (57,07%) 188.435.203 cache-misses # 57,679 % of all cache refs (56,76%)

testCache1 :

 3.519.968.253 L1-dcache-loads (57,17%) 193.664.806 L1-dcache-load-misses # 5,50% of all L1-dcache hits (57,24%) 256.638.490 cache-references (57,12%) 188.007.927 cache-misses # 73,258 % of all cache refs (57,23%)

And if I manually disable all the hardware preselectors:

testCache4 :

 846.124.474 L1-dcache-loads (57,22%) 192.495.450 L1-dcache-load-misses # 22,75% of all L1-dcache hits (57,31%) 193.699.811 cache-references (57,03%) 185.445.753 cache-misses # 95,739 % of all cache refs (57,17%)

testCache1 :

 3.534.308.118 L1-dcache-loads (57,16%) 193.595.962 L1-dcache-load-misses # 5,48% of all L1-dcache hits (57,18%) 193.639.498 cache-references (57,12%) 185.120.733 cache-misses # 95,601 % of all cache refs (57,15%)

As you can see, the differences have disappeared now. There were simply additional cache link events due to the prefetcher, and the actual link was counted twice.

In fact, if you think that all memory testCache1 have a lower overall cache ratio, since each tree_node referenced 4 times instead, but each member of the tree_node data is on the same cache line, therefore there is only one of 4 misses.

For testCache4 you can see that the L1d load skip ratio is actually close to 25%, which you would expect if sizeof(tree_node) == 16 and the cache line is 64 bytes.

Also, the compiler (at least gcc with -O2) applies tail recursion optimization for both functions, eliminating testCache4 recursion, making testCache1 one-way recursive. Therefore, testCache1 has many additional cache references that testCache4 does not have.

You can get the result without a prefetcher also with valgrind, which is probably also more reliable in its output. However, it does not model all the properties of CPU caches.

Regarding your changes: when I evaluated gcc aplies tail recursion, so that there are no calls in testCache4 , and of course, recursion and additional memory loads in testCache1 have a significant overhead compared to the simple load / add cycle on the left in testCache4 .

Why does this recursive C ++ function have this behavior in bad cache?

More articles: