"Simple repetition through a linked list becomes much slower as the structure grows, even if you actually pay attention to some other pointer."
With NPAD = 0, each cache line contains 8 list nodes, so you can understand why this is the fastest.
With NPAD = 7, 15, 31, for each node list, you need to load only one cache line, and you can expect all of them to be at the same speed - one cache miss per node. But a modern memory manager will do speculative caching. If it has a spare capacity (which is probably what happens because it can perform several readings in parallel with the main memory with modern memory), then it will start loading the memory next to the memory you are using. Although this is a linked list, if you built it in any obvious way, there is a good chance that you are accessing memory sequentially. So, the closer your list nodes are in memory, the more successful will be the cache that already has what you need.
In the worst case scenario, when your memory is removed from swap as it is used, your program will be limited to disk I / O. Perhaps your progress rate through the list will be completely determined by how many nodes on each page, and you can see that the time, which is directly proportional to the size of the node, is up to 4k. However, I have not tried this, and the OS will be smarter with a replacement, since the MMU is smart with main memory, so this is not necessarily so simple.
Steve jessop
source share