Using Visual C ++ 2010, I got the same synchronization results as in the comments above - on average, the second cycle takes about 80% of the execution time of the first. Once or twice, the first cycle was a little faster, but this could be due to some hiccup of the thread in the OS. A disassembly check led to the following:
First cycle:
01231120 cmp dword ptr [ebp-38h],esi 01231123 jbe main+1CBh (123120Bh) 01231129 cmp dword ptr [ebp-34h],10h 0123112D mov eax,dword ptr [ebp-48h] 01231130 jae main+0F5h (1231135h) 01231132 lea eax,[ebp-48h] 01231135 cmp byte ptr [eax+esi],62h 01231139 jne main+108h (1231148h) 0123113B mov ebx,1 01231140 lea eax,[ebp-80h] 01231143 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append (1231250h) 01231148 inc esi 01231149 cmp esi,5F5E100h 0123114F jl main+0E0h (1231120h)
Second cycle:
01231155 cmp dword ptr [ebp-1Ch],esi 01231158 jbe main+1CBh (123120Bh) 0123115E cmp dword ptr [ebp-18h],10h 01231162 mov eax,dword ptr [ebp-2Ch] 01231165 jae main+12Ah (123116Ah) 01231167 lea eax,[ebp-2Ch] 0123116A cmp byte ptr [eax+esi],62h 0123116E jne main+13Dh (123117Dh) 01231170 mov ebx,1 01231175 lea eax,[ebp-64h] 01231178 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::append (1231250h) 0123117D inc esi 0123117E cmp esi,5F5E100h 01231184 jl main+115h (1231155h)
Since the generated assembly looks more or less the same, I thought about throttling mechanisms in the OS or CPU and guessed what? Addition of Sleep (5000); between the two cycles led to the fact that the second cycle was (almost) always slower than the first. By running it 20 times, the second cycle takes on average about 150% of the first run time.
EDIT: Increasing the spincount value by a factor of five gives the same results. I assume that the operating time of about 0.5 s is more or less reliably measurable. :-)
In the source code, I think the OS may need several time series to detect CPU utilization, and then it starts giving the thread a higher priority during scheduling, and the processor may increase after that, leaving parts of the first cycle βuninstalledβ. "When the second cycle starts execution, OS / CPU can be primed for a large workload and execute a bit faster. The same can happen with processing pages of MMU or OS internal memory. When adding Sleep between cycles it can happen the opposite, causing the OS to postpone the thread for some time until a new workload is detected, which will make the second cycle slower.
What are your results? Does anyone have a suitable profiler, for example, Intel Amplifier, for measuring CPI rates and processor speed in cycles?