Which VS debugger makes increment statements faster than nothing?

Here is the code without explanation (performing the operation a billion times):

int k = 0; Stopwatch sw = new Stopwatch(); sw.Start(); for (int a = 0; a < 1000; a++) for (int b = 0; b < 1000; b++) for (int c = 0; c < 1000; c++) k++; sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); sw = new Stopwatch(); sw.Start(); for (int a = 0; a < 1000; a++) for (int b = 0; b < 1000; b++) for (int c = 0; c < 1000; c++) ; // NO-OP sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); 

Results (at least on my computer) somewhere around (in milliseconds)

 2168 2564 

The second will always be half a second longer.

How is it possible that the increment of a variable is a billion times larger than occupying no-op the same number of times?

EDIT: This only happens on DEBUG. Release does it right, the first one lasts longer, at least on my computer. As stated in the comments, someone ran into this problem even in the RELEASE build. But what happens on DEBUG, which creates this effect?

+6
source share
3 answers

The problem is that Azodious is mentioned, you cannot use debug mode to measure time, because it will be inaccurate.

In release mode, I get the following numbers:

Increment k : 445

NOP : 402

In the incremental version, there are 4 more IL commands:

 IL_0001: ldc.i4.0 IL_0002: stloc.0 IL_0003: ldc.i4.0 IL_0004: stloc.1 IL_0005: br.s IL_003B IL_0007: ldc.i4.0 IL_0008: stloc.2 IL_0009: br.s IL_0029 IL_000B: ldc.i4.0 IL_000C: stloc.3 IL_000D: br.s IL_0017 IL_000F: ldloc.0 IL_0010: ldc.i4.1 IL_0011: add IL_0012: stloc.0 IL_0013: ldloc.3 IL_0014: ldc.i4.1 IL_0015: add IL_0016: stloc.3 IL_0017: ldloc.3 IL_0018: ldc.i4 E8 03 00 00 IL_001D: clt IL_001F: stloc.s 04 IL_0021: ldloc.s 04 IL_0023: brtrue.s IL_000F IL_0025: ldloc.2 IL_0026: ldc.i4.1 IL_0027: add IL_0028: stloc.2 IL_0029: ldloc.2 IL_002A: ldc.i4 E8 03 00 00 IL_002F: clt IL_0031: stloc.s 04 IL_0033: ldloc.s 04 IL_0035: brtrue.s IL_000B IL_0037: ldloc.1 IL_0038: ldc.i4.1 IL_0039: add IL_003A: stloc.1 IL_003B: ldloc.1 IL_003C: ldc.i4 E8 03 00 00 IL_0041: clt IL_0043: stloc.s 04 IL_0045: ldloc.s 04 IL_0047: brtrue.s IL_0007 

NOP -verison has an equal number of branches, but fewer add :

 IL_0001: ldc.i4.0 IL_0002: stloc.0 IL_0003: ldc.i4.0 IL_0004: stloc.1 IL_0005: br.s IL_0037 IL_0007: ldc.i4.0 IL_0008: stloc.2 IL_0009: br.s IL_0025 IL_000B: ldc.i4.0 IL_000C: stloc.3 IL_000D: br.s IL_0013 IL_000F: ldloc.3 IL_0010: ldc.i4.1 IL_0011: add IL_0012: stloc.3 IL_0013: ldloc.3 IL_0014: ldc.i4 E8 03 00 00 IL_0019: clt IL_001B: stloc.s 04 IL_001D: ldloc.s 04 IL_001F: brtrue.s IL_000F IL_0021: ldloc.2 IL_0022: ldc.i4.1 IL_0023: add IL_0024: stloc.2 IL_0025: ldloc.2 IL_0026: ldc.i4 E8 03 00 00 IL_002B: clt IL_002D: stloc.s 04 IL_002F: ldloc.s 04 IL_0031: brtrue.s IL_000B IL_0033: ldloc.1 IL_0034: ldc.i4.1 IL_0035: add IL_0036: stloc.1 IL_0037: ldloc.1 IL_0038: ldc.i4 E8 03 00 00 IL_003D: clt IL_003F: stloc.s 04 IL_0041: ldloc.s 04 IL_0043: brtrue.s IL_0007 

They are compiled without optimization, because I want to see exactly what is happening.

The only difference between them in reality:

 IL_0012: stloc.0 IL_0013: ldloc.3 IL_0014: ldc.i4.1 IL_0015: add 

Simply put: you get weird numbers because you are in debug mode.

+4
source

Besides checking the wrong code, the main mistake you made is that you measured the value of the increment operator. You did not do this, you measured the cost of the for () loops. Which take more processor cycles than increment.

The problem with the for () loop is that the processor is forced to fork, returning to the beginning of the loop. Modern processors do not really like branching; they are optimized for sequential code execution. A side effect of a pipeline line, a basic architectural implementation designed to quickly execute code on a processor. An affiliate can cause the processor to flush piping, discarding a lot of work that has proven useless. Many resources are allocated as a processor to reduce pipeline cleaning costs. The main part is the branch predictor, it tries to guess in which direction the branch is going so that it can fill the pipeline with instructions that can be followed. Guessing wrong is very expensive. You have nothing to fear if the for () loop is long enough.

Another problem with modern processors is that they are very sensitive to aligning the branch goal. In other words, the address of the loop start instruction. If it is incorrectly aligned, and not at an address that is divided by 4 or 8, then additional cycles are required for the prefetch block to start decoding the correct instruction. This is detailed implementation information that the jitter should take care of, you may need to add additional NOP instructions to align the instruction. X86 jitter does not perform this optimization, x64 jitter does.

The observed side effect of alignment problems is that replacing two pieces of code can affect your measurements.

Benchmarking code is a dangerous adventure on a modern processor, the probability that you really get the real code that you observe when profiling a synthetic version of the code is not very good. Differences of 15% or less are not statistically significant.

+1
source

I ran 3 times, and outputs:

3786

3252


3800

3256


3840

3255

So, if you make a decision based on statistics collected in debug mode, Do not.

Debug mode has attached a lot of data to the code, which helps the debugger during debugging.

0
source

Source: https://habr.com/ru/post/925631/


All Articles