How to synchronize TSC across cores?

Using:

inline uint64_t rdtsc() { uint32_t cycles_high; uint32_t cycles_low; asm volatile ("CPUID\n\t" "RDTSC\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax", "%rbx", "%rcx", "%rdx"); return ( ((uint64_t)cycles_high << 32) | cycles_low ); } 

stream 1 works

 while(globalIndex < COUNT) { while(globalIndex %2 == 0 && globalIndex < COUNT) ; cycles[globalIndex][0] = rdtsc(); cycles[globalIndex][1] = cpuToBindTo; __sync_add_and_fetch(&globalIndex,1); } 

stream 2 works

 while(globalIndex < COUNT) { while(globalIndex %2 == 1 && globalIndex < COUNT) ; cycles[globalIndex][0] = rdtsc(); cycles[globalIndex][1] = cpuToBindTo; __sync_add_and_fetch(&globalIndex,1); } 

I see

 CPU rdtsc() t1-t0 11 = 5023231563212740 990 03 = 5023231563213730 310 11 = 5023231563214040 990 03 = 5023231563215030 310 11 = 5023231563215340 990 03 = 5023231563216330 310 11 = 5023231563216640 990 03 = 5023231563217630 310 11 = 5023231563217940 990 03 = 5023231563218930 310 11 = 5023231563219240 990 03 = 5023231563220230 310 11 = 5023231563220540 990 03 = 5023231563221530 310 11 = 5023231563221840 990 03 = 5023231563222830 310 11 = 5023231563223140 990 03 = 5023231563224130 310 11 = 5023231563224440 990 03 = 5023231563225430 310 11 = 5023231563225740 990 03 = 5023231561739842 310 11 = 5023231561740152 990 03 = 5023231561741142 310 11 = 5023231561741452 12458 03 = 5023231561753910 458 11 = 5023231561754368 1154 03 = 5023231561755522 318 11 = 5023231561755840 982 03 = 5023231561756822 310 11 = 5023231561757132 990 03 = 5023231561758122 310 11 = 5023231561758432 990 03 = 5023231561759422 310 

I'm not sure how I got pong 12458, but I was wondering why I saw 310-990-310 instead of 650-650-650. I thought tsc should be synchronized across all cores. my const_tsc constant flag is on.

+4
source share
2 answers

What do you use for this code? TSC synchronization must be performed on the OS / kernel and is hardware dependent. For example, you can pass a flag, like powernow-k8.tscsync=1 , to the kernel boot options through your bootloader.

You need to find the right TSC synchronization method for your combination of OS and hardware. By and large, all this is automated - I won’t be surprised if you work on a custom kernel or without i686 hardware?

If you search on Google with the correct terms, you will find many resources, such as discussion on the mailing list on this topic. For example, one algorithm was discussed here (although this is apparently not very good). However, this is not something the userland developers need to worry about - it is a secret witchcraft with which only kernel developers should worry about their heads.

Basically, this is the operation of the OS at boot time to synchronize TSC counters between all processors and / or cores on the SMP machine with a certain margin of error. If you see diversified numbers, something is wrong with TSC synchronization, and your time will be better spent figuring out why your OS did not synchronize TSC correctly, and did not try to implement its own TSC synchronization algorithm.

+1
source

Do you have a NUMA memory architecture? The global counter can be placed in RAM, which is a couple of steps for one of the processors and local for the other. You can verify this by installing your threads in kernels on the same NUMA node.

EDIT: I assumed this since the performance was processor specific.

EDIT: Regarding TSC sync. I do not know about the easy way, which does not mean that it does not exist! What happens if you take core 1 as a reference measure and then compare it with core 2? If you have done this comparison many times and accepted the minimum, you could have a good approximation. This should handle the case when you unload in the middle of the comparison.

0
source

Source: https://habr.com/ru/post/1411136/


All Articles