Two TLB misses on mmap / access / munmap

for (int i = 0; i < 100000; ++i) { int *page = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); page[0] = 0; munmap(page, PAGE_SIZE); } 

I expect to get ~ 100000 dTLB-store-misses in user space, one for each iteration (also ~ 100000 page errors and dTLB-load-misses for the kernel). By executing the following command, the result is approximately 2, which I expect. I would appreciate if anyone could clarify why this is so:

 perf stat -e dTLB-store-misses:u ./test Performance counter stats for './test': 200,114 dTLB-store-misses 0.213379649 seconds time elapsed 

PS I checked and I am sure that the generated code does not enter anything that could justify this result. In addition, I get ~ 100,000 page errors and dTLB-load-misses: k.

+8
performance c tlb performancecounter perf
source share
2 answers

I expect to get ~ 100000 dTLB-store-misses in user space, one for each iteration

I would expect that:

  • The CPU is trying to execute page[0] = 0; , tries to load the cache line containing page[0] , cannot find a TLB entry for it, increases dTLB-load-misses , selects the translation, implements the "no" page, then generates a page error.
  • The page error handler selects the page and (since the page table has been modified) ensures that the TLB entry is invalid (possibly relying on the fact that the Intel processor does not cache the "not present" pages in any case, not necessarily explicitly doing INVLPG ). The page error handler returns to the instruction that caused the error so that it can be repeated.
  • The CPU is trying to execute page[0] = 0; the second time, it tries to load a cache line containing page[0] , cannot find a TLB entry for it, increases dTLB-load-misses , retrieves the translation, and then modifies the cache line.

For fun, you can use the MAP_POPULATE flag with mmap() to try to get the kernel to pre-allocate pages (and to avoid a page error and the first TLB skip).

+6
source share

Update 2 . I think Brendan's answer is right. I have to remove this, but the ocperf.py sentence is still useful for future readers, I think. And that might explain the extra TLB omissions on processors without process IDs and kernel contexts that soften Meltdown.

Update : The assumption below was incorrect. New assumption: mmap should change your process page table, so maybe there is some TLB invalidation from just that. My recommendation is to use ocperf.py record to try to figure out which ASM instructions cause TLB skips. Even when optimization is enabled, the code will be stored on the stack when pressing / displaying the return address for calls to the glibc shell function.


It is possible that in your kernel page / table isolation for the kernel / user was turned on to mitigate Meltdown , so when returning from the kernel to the user, all TLB entries were canceled (by changing CR3 to point to page tables that do not include kernel mappings at all).

Locate the Kernel/User page tables isolation: enabled in your dmesg output file. You can try booting with kpti=off as a kernel option to disable it if you don't mind being vulnerable to Meltdown during testing.


Since you use C, you use the mmap and munmap system calls through your glibc shells, and not with the built-in syscall instructions. The ret in this shell should load the return address from the stack that the TLB skips.

Extra misses in the repositories probably come from call instructions pushing the return address, although I'm not sure if this is correct, because the current stack page should already be in the TLB from ret from the previous system call.


You can create an ocperf.py profile to get symbolic names for uarch related events . Assuming you're on a recent Intel processor, ocperf.py record -e mem_inst_retired.stlb_miss_stores,page-faults,dTLB-load-misses , to find out what instructions cause the store to miss. (Then use ocperf.py report -Mintel ). If report does not allow you to easily select which event to see the counters, only a record with one event.

mem_inst_retired.stlb_miss_stores is an "accurate" event, unlike most other TLB store events, so the calculations should be for a real instruction, and not some later instructions such as inaccurate Persian events. (See Andy Gly's trap against responding to an exception for some details on why some performance counters might not be accurate; many storage events are not.)


+3
source share

All Articles