Rdtsc, too many cycles

#include <stdio.h> static inline unsigned long long tick() { unsigned long long d; __asm__ __volatile__ ("rdtsc" : "=A" (d) ); return d; } int main() { long long res; res=tick(); res=tick()-res; printf("%d",res); return 0; } 

I compiled this code with gcc with optimization -O0 -O1 -O2 -O3. And I always get 2000-2500 cycles. Can anyone explain the reason for this conclusion? How to carry out these cycles?

The first checkmark function is incorrect. This is correct .

Another version of the tick function

 static __inline__ unsigned long long tick() { unsigned hi, lo; __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); } 

This is the build code for -O3

  .file "rdtsc.c" .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "%d" .text .p2align 4,,15 .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp subl $40, %esp movl %ecx, -16(%ebp) movl %ebx, -12(%ebp) movl %esi, -8(%ebp) movl %edi, -4(%ebp) #APP # 6 "rdtsc.c" 1 rdtsc # 0 "" 2 #NO_APP movl %edx, %edi movl %eax, %esi #APP # 6 "rdtsc.c" 1 rdtsc # 0 "" 2 #NO_APP movl %eax, %ecx movl %edx, %ebx subl %esi, %ecx sbbl %edi, %ebx movl %ecx, 4(%esp) movl %ebx, 8(%esp) movl $.LC0, (%esp) call printf movl -16(%ebp), %ecx xorl %eax, %eax movl -12(%ebp), %ebx movl -8(%ebp), %esi movl -4(%ebp), %edi movl %ebp, %esp popl %ebp leal -4(%ecx), %esp ret .size main, .-main .ident "GCC: (Debian 4.3.2-1.1) 4.3.2" .section .note.GNU-stack,"",@progbits 

This is CPU

 processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 3 cpu MHz : 3000.105 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss constant_tsc up pebs bts pni bogomips : 6036.62 clflush size : 64 
+8
c assembly x86 rdtsc
source share
5 answers

I tried your code on several Linux distributions running on different Intel processors (admittedly, all fresher than the Pentium 4 HT 630 you are using). In all these tests, I got values ​​between 25 and 50 cycles.

My only hypothesis, which is consistent with all the evidence, is that you are running your operating system inside a virtual machine, and not on bare metal, and TSC becomes virtualized.

+9
source share

There are many reasons for getting large quantities:

  • The OS performed a context switch, and your process got caught.
  • A disk search has occurred and your process has been started.
  • ... any of the many reasons why your process may be ignored.

Note that rdtsc not particularly reliable for out-of-sync, because:

  • Processor speeds can vary, and therefore the cycle length (when measured in seconds).
  • Different processors may have different values ​​for TSC at a given point in time.

Most operating systems have a precision clock or synchronization method. clock_gettime on Linux, for example, a monotonous clock. (Also understand the difference between a wall clock and a monotonous clock: a wall clock can move backward - even at UTC.) On Windows, I find that the recommendation is QueryHighPerformanceCounter . Usually this watch provides more than enough accuracy for most needs.


Also, looking at the assembly, it looks like you get only a 32-bit answer: I don't see %edx saved after rdtsc .


By running the code, I get timings from 120-150 ns for clock_gettime , using CLOCK_MONOTONIC , and 70-90 cycles for rdtsc (~ 20 ns at full speed, but I suspect that the processor works clockwise, and it really is about 50 ns) . (On the desktop (encrypt SSH, they forgot which machine I was on!), which is about 20% CPU usage). Sure your car isn’t stuck?

+6
source share

It looks like your OS has disabled RDTSC execution in user space. And your application should switch to the kernel and back, which takes a lot of cycles.

This is the Intel Software Developers Guide:

In Protected or Virtual 8086 mode, the timer disable flag (TSD) in the CR4 register restricts the use of the RDTSC instruction as follows. When the TSD flag is clear, the RDTSC instruction can be executed at any privilege level; when the flag, the command can only be executed at privilege level 0. (When the real address is mode, the RDTSC command is always on.)

Edit:

Responding to aix's comment, I will explain why TSD is most likely the reason here.

I only know these possibilities, so that the program executes one command longer than usual:

  • Running under some emulator,
  • using self-modified code,
  • context switch
  • .

The first 2 reasons cannot usually delay the execution of more than a few hundred cycles. 2000-2500 cycles are more typical for a context / kernel switch. But it’s almost impossible to catch the context switch several times in the same place. So it should be a kernel switch. This means that either the program runs under the debugger or RDTSC is not allowed in user mode.

The most likely reason for disabling the RDTSC OS may be security. There have been attempts to use RDTSC to crack encryption programs.

+4
source share

Command caching error? (this is my guess)

Also maybe

Switch to a hypervisor in a virtualized system? Remains of bootstrap (including network activity on a single processor)?

To Thanatos: On systems later than 2008, rdtsc () is a wall clock and does not change with frequency steps.

Can you try this little code?

 int main() { long long res; fflush(stdout); // chnage the exact timing of stdout, in case there is something to write in a ssh connection, together with its interrupts for (int pass = 0; pass < 2; pass++) { res=tick(); res=tick()-res; } printf("%d",res); // ignore result on first pass, display the result on second pass. return 0; } 
+1
source share

Just an idea - maybe these two rdtsc commands run on different kernels? Rdtsc values ​​may vary slightly across cores.

0
source share

All Articles