Sample code showing that listing for uint is more efficient than range checking

So, I am considering this issue , and the general consensus is that the uint cast version is more efficient than checking the range with 0. Since the code is also used in the MS implementation from the list, I assume that this is a real optimization. However, I was not able to create sample code that provides the best performance for the uint version. I tried different tests, and something is missing or some other part of my code overshadows the verification time. My last attempt looks like this:

class TestType { public TestType(int size) { MaxSize = size; Random rand = new Random(100); for (int i = 0; i < MaxIterations; i++) { indexes[i] = rand.Next(0, MaxSize); } } public const int MaxIterations = 10000000; private int MaxSize; private int[] indexes = new int[MaxIterations]; public void Test() { var timer = new Stopwatch(); int inRange = 0; int outOfRange = 0; timer.Start(); for (int i = 0; i < MaxIterations; i++) { int x = indexes[i]; if (x < 0 || x > MaxSize) { throw new Exception(); } inRange += indexes[x]; } timer.Stop(); Console.WriteLine("Comparision 1: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms"); inRange = 0; outOfRange = 0; timer.Reset(); timer.Start(); for (int i = 0; i < MaxIterations; i++) { int x = indexes[i]; if ((uint)x > (uint)MaxSize) { throw new Exception(); } inRange += indexes[x]; } timer.Stop(); Console.WriteLine("Comparision 2: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms"); } } class Program { static void Main() { TestType t = new TestType(TestType.MaxIterations); t.Test(); TestType t2 = new TestType(TestType.MaxIterations); t2.Test(); TestType t3 = new TestType(TestType.MaxIterations); t3.Test(); } } 

The code is a bit of a mess because I tried a lot of things to do a uint check faster than moving the compared variable in the class field, generating random access to the index, etc., but in each case the result seems the same for both versions. So, this change applies to modern x86 processors, and can anyone demonstrate this in some way?

Please note that I am not asking someone to correct my sample or explain what is wrong with it. I just want to see a case where optimization works.

+6
c # clr jit
Mar 31 '15 at 21:23
source share
3 answers

I would suggest trying code that does not throw an exception when index is out of range. Exceptions are incredibly expensive and can completely reset the results of your bench.

In the code below, the weighted average bench for 1000 iterations of 1,000,000 results.

 using System; using System.Diagnostics; namespace BenchTest { class Program { const int LoopCount = 1000000; const int AverageCount = 1000; static void Main(string[] args) { Console.WriteLine("Starting Benchmark"); RunTest(); Console.WriteLine("Finished Benchmark"); Console.Write("Press any key to exit..."); Console.ReadKey(); } static void RunTest() { int cursorRow = Console.CursorTop; int cursorCol = Console.CursorLeft; long totalTime1 = 0; long totalTime2 = 0; long invalidOperationCount1 = 0; long invalidOperationCount2 = 0; for (int i = 0; i < AverageCount; i++) { Console.SetCursorPosition(cursorCol, cursorRow); Console.WriteLine("Running iteration: {0}/{1}", i + 1, AverageCount); int[] indexArgs = RandomFill(LoopCount, int.MinValue, int.MaxValue); int[] sizeArgs = RandomFill(LoopCount, 0, int.MaxValue); totalTime1 += RunLoop(TestMethod1, indexArgs, sizeArgs, ref invalidOperationCount1); totalTime2 += RunLoop(TestMethod2, indexArgs, sizeArgs, ref invalidOperationCount2); } PrintResult("Test 1", TimeSpan.FromTicks(totalTime1 / AverageCount), invalidOperationCount1); PrintResult("Test 2", TimeSpan.FromTicks(totalTime2 / AverageCount), invalidOperationCount2); } static void PrintResult(string testName, TimeSpan averageTime, long invalidOperationCount) { Console.WriteLine(testName); Console.WriteLine(" Average Time: {0}", averageTime); Console.WriteLine(" Invalid Operations: {0} ({1})", invalidOperationCount, (invalidOperationCount / (double)(AverageCount * LoopCount)).ToString("P3")); } static long RunLoop(Func<int, int, int> testMethod, int[] indexArgs, int[] sizeArgs, ref long invalidOperationCount) { Stopwatch sw = new Stopwatch(); Console.Write("Running {0} sub-iterations", LoopCount); sw.Start(); long startTickCount = sw.ElapsedTicks; for (int i = 0; i < LoopCount; i++) { invalidOperationCount += testMethod(indexArgs[i], sizeArgs[i]); } sw.Stop(); long stopTickCount = sw.ElapsedTicks; long elapsedTickCount = stopTickCount - startTickCount; Console.WriteLine(" - Time Taken: {0}", new TimeSpan(elapsedTickCount)); return elapsedTickCount; } static int[] RandomFill(int size, int minValue, int maxValue) { int[] randomArray = new int[size]; Random rng = new Random(); for (int i = 0; i < size; i++) { randomArray[i] = rng.Next(minValue, maxValue); } return randomArray; } static int TestMethod1(int index, int size) { return (index < 0 || index >= size) ? 1 : 0; } static int TestMethod2(int index, int size) { return ((uint)(index) >= (uint)(size)) ? 1 : 0; } } } 
+3
Mar 31 '15 at 22:44
source share
  if (x < 0 || x > MaxSize) 

The comparison is performed by the instruction of the processor CMP (Compare). You want to take a look at the Agner Fog instruction table document (PDF), it lists the cost of the instructions. Find your processor back in the list, then find the CMP instruction.

For mine, Haswell, CMP takes 1 latency cycle and 0.25 bandwidth cycles.

Fractional costs like this can use an explanation; Haswell has 4 integer execution units that can execute instructions at the same time. When a program contains enough integer operations, such as CMP, without interdependence, they can perform all at the same time. In fact, the program is 4 times faster. You can’t always keep all 4 of them occupied at the same time as your code; in fact, this is quite rare. But in this case, you make 2 of them busy. Or, in other words, two comparisons take as much as one, 1 cycle.

There are other factors in the game that make the runtime identical. One thing that helps is that the processor can very well predict the branch, it can speculatively execute x > MaxSize , despite evaluating the short circuit. And in fact, this will lead to a result, since the branch is never taken.

And the real bottleneck in this code is array indexing, memory access is one of the slowest things a processor can do. Thus, the β€œfast” version of the code is not accelerated, although it provides more features, allowing the processor to execute instructions at the same time. Today this is not a big opportunity, the processor has too many execution units to stay busy. Otherwise, the HyperThreading function works. In both cases, the processor fights at the same speed.

On my machine, I have to write code that takes more than 4 engines to make it slower. The stupid code is as follows:

  if (x < 0 || x > MaxSize || x > 10000000 || x > 20000000 || x > 3000000) { outOfRange++; } else { inRange++; } 

Using 5 comparisons, now I can make a difference, 61 versus 47 ms. Or, in other words, this is a way of counting the number of whole engines in a processor. Hehe :)

So, this is micro-optimization, which was probably used to pay off decades ago. This is no longer the case. Change your list of things to worry about :)

+13
Mar 31 '15 at 23:22
source share

You can not compare with the like.

The code you talked about not only saved one branch using optimization, but also 4 bytes of CIL in a small method.

In a small method, 4 bytes can be the difference in being inline and not inline.

And if the method that calls this method is also written as small, then this may mean that two (or more) methods will be called as one piece of inline code.

And perhaps some of them then, because they are built-in and available for analysis by jitter, are again optimized.

The real difference is not between index < 0 || index >= _size index < 0 || index >= _size and (uint)index >= (uint)_size , and between code that has repeatedly attempted to minimize body size and code that does not. See, for example, how another method is used to throw an exception, if necessary, and then flushes a couple of CIL bytes.

(And no, this does not mean that I think that all methods should be written this way, but, of course, there may be differences in performance if this is done).

+3
Apr 01 '15 at 0:31
source share



All Articles