Update 2 . I think Brendan's answer is right. I have to remove this, but the ocperf.py sentence is still useful for future readers, I think. And that might explain the extra TLB omissions on processors without process IDs and kernel contexts that soften Meltdown.
Update : The assumption below was incorrect. New assumption: mmap should change your process page table, so maybe there is some TLB invalidation from just that. My recommendation is to use ocperf.py record to try to figure out which ASM instructions cause TLB skips. Even when optimization is enabled, the code will be stored on the stack when pressing / displaying the return address for calls to the glibc shell function.
It is possible that in your kernel page / table isolation for the kernel / user was turned on to mitigate Meltdown , so when returning from the kernel to the user, all TLB entries were canceled (by changing CR3 to point to page tables that do not include kernel mappings at all).
Locate the Kernel/User page tables isolation: enabled in your dmesg output file. You can try booting with kpti=off as a kernel option to disable it if you don't mind being vulnerable to Meltdown during testing.
Since you use C, you use the mmap and munmap system calls through your glibc shells, and not with the built-in syscall instructions. The ret in this shell should load the return address from the stack that the TLB skips.
Extra misses in the repositories probably come from call instructions pushing the return address, although I'm not sure if this is correct, because the current stack page should already be in the TLB from ret from the previous system call.
You can create an ocperf.py profile to get symbolic names for uarch related events . Assuming you're on a recent Intel processor, ocperf.py record -e mem_inst_retired.stlb_miss_stores,page-faults,dTLB-load-misses , to find out what instructions cause the store to miss. (Then use ocperf.py report -Mintel ). If report does not allow you to easily select which event to see the counters, only a record with one event.
mem_inst_retired.stlb_miss_stores is an "accurate" event, unlike most other TLB store events, so the calculations should be for a real instruction, and not some later instructions such as inaccurate Persian events. (See Andy Gly's trap against responding to an exception for some details on why some performance counters might not be accurate; many storage events are not.)