Access to a 5-byte structure is much slower than 8-byte

I have code that updates an array based on values ​​in another, small array.

for (var i = 0; i < result.Length; i++) { var c = cards[i]; result[i] -= one[c.C0] + one[c.C1]; } 

Where c is a structure representing a pair of bytes representing cards from the deck. one - size of the array 52 (with entries for each of 52 cards from the deck)

I wrote a test to comment on this code:

 private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards) { for (var r = 0; r < testRepetitions; r++) for (var i = 0; i < result.Length; i++) { var c = cards[i]; result[i] -= one[c.C0] + one[c.C1]; } } 

Setting testRepetitions = 25 million, and using an array of 256 elements ( result.Length = 256 ), it runs for about 8.5 seconds on my machine.

Here is the Cards struct:

 struct Cards { public byte C0; public byte C1; public Cards(byte c0, byte c1) { C0 = c0; C1 = c1; } } 

When I change this structure to save 5 cards (5 bytes), the same figure now takes ~ 13 seconds. Why is this going to happen? The calculation is the same, the remaining 3 cards are not used, and all arrays are small enough to enter the L1 cache.

What is even stranger is that if I changed the Maps even more, now they will contain 8 bytes, now this speed will be faster and take ~ 10 seconds.

My setup:

 VS 2015 Update 3. .NET 4.6.2 Release Build x64 CPU: Haswell i7-5820K CPU @ 3.30GHz 

Here are the exact timings I received:

 Test With 2 Cards. Time = 8582 ms Test With 5 Cards. Time = 12910 ms Test With 8 Cards. Time = 10180 ms 

What's going on here?

Security Code:

 class TestAdjustment { public void Test() { using (Process p = Process.GetCurrentProcess()) p.PriorityClass = ProcessPriorityClass.High; var size = 256; float[] one = ArrayUtils.CreateRandomFloatArray(size:52); int[] card0 = ArrayUtils.RandomIntArray(size, minValue:0, maxValueInclusive:51); int[] card1 = ArrayUtils.RandomIntArray(size, minValue: 0, maxValueInclusive: 51); Cards[] cards = CreateCardsArray(card0, card1); Cards5[] cards5 = CreateCards5Array(card0, card1); Cards8[] cards8 = CreateCards8Array(card0, card1); float[] result = ArrayUtils.CreateRandomFloatArray(size); float[] resultClone = result.ToArray(); var testRepetitions = 25*1000*1000; var sw = Stopwatch.StartNew(); TestCards2(testRepetitions, result, one, cards); WriteLine($"Test With 2 Cards. Time = {sw.ElapsedMilliseconds} ms"); result = resultClone.ToArray(); //restore original array from the clone, so that next method works on the same data sw.Restart(); TestCards5(testRepetitions, result, one, cards5); WriteLine($"Test With 5 Cards. Time = {sw.ElapsedMilliseconds} ms"); result = resultClone.ToArray(); sw.Restart(); TestCards8(testRepetitions, result, one, cards8); WriteLine($"Test With 8 Cards. Time = {sw.ElapsedMilliseconds} ms"); } private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards) { for (var r = 0; r < testRepetitions; r++) for (var i = 0; i < result.Length; i++) { var c = cards[i]; result[i] -= one[c.C0] + one[c.C1]; } } private void TestCards5(int testRepetitions, float[] result, float[] one, Cards5[] cards) { for (var r = 0; r < testRepetitions; r++) for (var i = 0; i < result.Length; i++) { var c = cards[i]; result[i] -= one[c.C0] + one[c.C1]; } } private void TestCards8(int testRepetitions, float[] result, float[] one, Cards8[] cards) { for (var r = 0; r < testRepetitions; r++) for (var i = 0; i < result.Length; i++) { var c = cards[i]; result[i] -= one[c.C0] + one[c.C1]; } } private Cards[] CreateCardsArray(int[] c0, int[] c1) { var result = new Cards[c0.Length]; for (var i = 0; i < result.Length; i++) result[i] = new Cards((byte)c0[i], (byte)c1[i]); return result; } private Cards5[] CreateCards5Array(int[] c0, int[] c1) { var result = new Cards5[c0.Length]; for (var i = 0; i < result.Length; i++) result[i] = new Cards5((byte)c0[i], (byte)c1[i]); return result; } private Cards8[] CreateCards8Array(int[] c0, int[] c1) { var result = new Cards8[c0.Length]; for (var i = 0; i < result.Length; i++) result[i] = new Cards8((byte)c0[i], (byte)c1[i]); return result; } } struct Cards { public byte C0; public byte C1; public Cards(byte c0, byte c1) { C0 = c0; C1 = c1; } } struct Cards5 { public byte C0, C1, C2, C3, C4; public Cards5(byte c0, byte c1) { C0 = c0; C1 = c1; C2 = C3 = C4 = 0; } } struct Cards8 { public byte C0, C1, C2, C3, C4, C5, C6, C7; public Cards8(byte c0, byte c1) { C0 = c0; C1 = c1; C2 = C3 = C4 = C5 = C6 = C7 = 0; } } 

Edit I ran the test again, with 100 million iterations. Here are the results:

 Test With 5 Cards. Time = 52245 ms Test With 8 Cards. Time = 40531 ms 

And in reverse order:

 Test With 8 Cards. Time = 41041 ms Test With 5 Cards. Time = 52034 ms 

Launch on Surface Pro 4 (Skylake i7-6650U Turbo-boosted up to ~ 3.4gz):

 Test With 8 Cards. Time = 47913 ms Test With 5 Cards. Time = 55182 ms 

Thus, the difference remains and does not depend on the order.

I also used profiling using Intel VTune, and it shows CPI 0.3 for the "5 cards" version and 0.27 for the "8 cards".

Edit2 Added ArrayUtils class to create original random arrays.

  public static class ArrayUtils { static Random rand = new Random(137); public static float[] CreateRandomFloatArray(int size) { var result = new float[size]; for (int i = 0; i < size; i++) result[i] = (float) rand.NextDouble(); return result; } public static int[] RandomIntArray(int size, int minValue, int maxValueInclusive) { var result = new int[size]; for (int i = 0; i < size; i++) result[i] = rand.Next(minValue, maxValueInclusive + 1); return result; } } 
+5
source share
2 answers

All about unacceptable memory access. Insufficient readiness latency for memory is greater than latency for aligned memory. To complete the experiment, add the structures Cards3 , Cards4 and so on. And let's see how the corresponding arrays are represented in memory.

enter image description here

Then upgrade your test.

  • We will use BenchmarkDotNet (this tool will perform many benchmarks, such as warm-up, automatic selection of the number of iterations, statistics calculation, etc.).
  • We will run our test for all arrays of Cards2 .. Cards8 , not just 3 of them.
  • We will also check all JIT compilers for the full .NET Framework (LegacyJIT-x86, LegacyJIT-x64, RyuJIT-x64) and Mono.

Here is my environment:

 Host Process Environment Information: BenchmarkDotNet.Core=v0.9.9.0 OS=Microsoft Windows NT 6.2.9200.0 Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8 Frequency=2728068 ticks, Resolution=366.5598 ns, Timer=TSC CLR1=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT] CLR2=Mono JIT compiler version 4.4.0, Arch=32-bit GC=Concurrent Workstation JitModules=clrjit-v4.6.1080.0 

And my results are:

  Method | Platform | Jit | Toolchain | Runtime | Median | StdDev | ------- |--------- |---------- |---------- |-------- |---------- |---------- | C2 | Host | Host | Mono | Mono | 3.9230 ns | 0.0532 ns | C3 | Host | Host | Mono | Mono | 4.8223 ns | 0.0920 ns | C4 | Host | Host | Mono | Mono | 5.9149 ns | 0.1207 ns | C5 | Host | Host | Mono | Mono | 6.3981 ns | 0.0913 ns | C6 | Host | Host | Mono | Mono | 7.1179 ns | 0.1222 ns | C7 | Host | Host | Mono | Mono | 7.6318 ns | 0.1269 ns | C8 | Host | Host | Mono | Mono | 8.4650 ns | 0.1497 ns | C2 | X64 | LegacyJit | Host | Host | 2.3515 ns | 0.0150 ns | C3 | X64 | LegacyJit | Host | Host | 4.2553 ns | 0.0700 ns | C4 | X64 | LegacyJit | Host | Host | 1.4366 ns | 0.0385 ns | C5 | X64 | LegacyJit | Host | Host | 2.3688 ns | 0.0359 ns | C6 | X64 | LegacyJit | Host | Host | 2.3684 ns | 0.0404 ns | C7 | X64 | LegacyJit | Host | Host | 3.0404 ns | 0.0664 ns | C8 | X64 | LegacyJit | Host | Host | 1.4510 ns | 0.0333 ns | C2 | X64 | RyuJit | Host | Host | 1.9281 ns | 0.0306 ns | C3 | X64 | RyuJit | Host | Host | 2.1183 ns | 0.0348 ns | C4 | X64 | RyuJit | Host | Host | 1.9395 ns | 0.0397 ns | C5 | X64 | RyuJit | Host | Host | 2.7706 ns | 0.0387 ns | C6 | X64 | RyuJit | Host | Host | 2.6471 ns | 0.0513 ns | C7 | X64 | RyuJit | Host | Host | 2.9743 ns | 0.0541 ns | C8 | X64 | RyuJit | Host | Host | 2.6280 ns | 0.1526 ns | C2 | X86 | LegacyJit | Host | Host | 3.0854 ns | 0.2172 ns | C3 | X86 | LegacyJit | Host | Host | 3.1627 ns | 0.1126 ns | C4 | X86 | LegacyJit | Host | Host | 3.0577 ns | 0.0929 ns | C5 | X86 | LegacyJit | Host | Host | 5.0957 ns | 0.1601 ns | C6 | X86 | LegacyJit | Host | Host | 6.1723 ns | 0.1177 ns | C7 | X86 | LegacyJit | Host | Host | 7.1155 ns | 0.0803 ns | C8 | X86 | LegacyJit | Host | Host | 3.7703 ns | 0.1276 ns | 

enter image description here

Full source code:

 using System; using System.Linq; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Attributes.Exporters; using BenchmarkDotNet.Attributes.Jobs; using BenchmarkDotNet.Running; [LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob] [RPlotExporter] public class CardBenchmarks { public const int Size = 256; private float[] result, one; private Cards2[] cards2; private Cards3[] cards3; private Cards4[] cards4; private Cards5[] cards5; private Cards6[] cards6; private Cards7[] cards7; private Cards8[] cards8; [Setup] public void Setup() { result = ArrayUtils.CreateRandomFloatArray(Size); one = ArrayUtils.CreateRandomFloatArray(size: 52); var c0 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51); var c1 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51); cards2 = CardUtls.Create2(c0, c1); cards3 = CardUtls.Create3(c0, c1); cards4 = CardUtls.Create4(c0, c1); cards5 = CardUtls.Create5(c0, c1); cards6 = CardUtls.Create6(c0, c1); cards7 = CardUtls.Create7(c0, c1); cards8 = CardUtls.Create8(c0, c1); } [Benchmark(OperationsPerInvoke = Size)] public void C2() { for (var i = 0; i < result.Length; i++) { var c = cards2[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C3() { for (var i = 0; i < result.Length; i++) { var c = cards3[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C4() { for (var i = 0; i < result.Length; i++) { var c = cards4[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C5() { for (var i = 0; i < result.Length; i++) { var c = cards5[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C6() { for (var i = 0; i < result.Length; i++) { var c = cards6[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C7() { for (var i = 0; i < result.Length; i++) { var c = cards7[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C8() { for (var i = 0; i < result.Length; i++) { var c = cards8[i]; result[i] -= one[c.C0] + one[c.C1]; } } } public static class ArrayUtils { private static readonly Random Rand = new Random(137); public static float[] CreateRandomFloatArray(int size) { var result = new float[size]; for (int i = 0; i < size; i++) result[i] = (float) Rand.NextDouble(); return result; } public static byte[] RandomByteArray(int size, int minValue, int maxValueInclusive) { var result = new byte[size]; for (int i = 0; i < size; i++) result[i] = (byte) Rand.Next(minValue, maxValueInclusive + 1); return result; } } public static class CardUtls { private static T[] Create<T>(int length, Func<int, T> create) => Enumerable.Range(0, length).Select(create).ToArray(); public static Cards2[] Create2(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards2 {C0 = c0[i], C1 = c1[i]}); public static Cards3[] Create3(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards3 {C0 = c0[i], C1 = c1[i]}); public static Cards4[] Create4(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards4 {C0 = c0[i], C1 = c1[i]}); public static Cards5[] Create5(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards5 {C0 = c0[i], C1 = c1[i]}); public static Cards6[] Create6(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards6 {C0 = c0[i], C1 = c1[i]}); public static Cards7[] Create7(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards7 {C0 = c0[i], C1 = c1[i]}); public static Cards8[] Create8(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards8 {C0 = c0[i], C1 = c1[i]}); } public struct Cards2 { public byte C0, C1; } public struct Cards3 { public byte C0, C1, C2; } public struct Cards4 { public byte C0, C1, C2, C3; } public struct Cards5 { public byte C0, C1, C2, C3, C4; } public struct Cards6 { public byte C0, C1, C2, C3, C4, C5; } public struct Cards7 { public byte C0, C1, C2, C3, C4, C5, C6; } public struct Cards8 { public byte C0, C1, C2, C3, C4, C5, C6, C7; } class Program { static void Main() { BenchmarkRunner.Run<CardBenchmarks>(); } } 

Answer

As you can see, your benchmark is heavy, there are many different factors that affect your performance. One of the most important things is how your runtime breaks your data. For example, you will not observe the described behavior in Mono, because Mono and Full Framework have different layout algorithms (in Mono we have Marshal.SizeOf(typeof(Card2)) == 8 ).

We have Time(Card5) > Time(Card8) in full-screen mode, because Card5 produces many unsettled reads unlike Card8 (see the first image).

Update

Question from the comment :

Any idea why 3 bytes works better than 8 bytes on your checkmark RyuJIT64?

Take a look at the asm code:

 ; RyuJIT-x64, C3 var c = cards3[i]; 00007FFEDF0CADCE mov r10,r9 00007FFEDF0CADD1 mov r11d,dword ptr [r10+8] 00007FFEDF0CADD5 cmp eax,r11d 00007FFEDF0CADD8 jae 00007FFEDF0CAE44 00007FFEDF0CADDA movsxd r11,eax 00007FFEDF0CADDD imul r11,r11,3 00007FFEDF0CADE1 lea r10,[r10+r11+10h] 00007FFEDF0CADE6 movzx r11d,byte ptr [r10] ; !!! 00007FFEDF0CADEA movzx r10d,byte ptr [r10+1] ; !!! ; RyuJIT-x64, C8 var c = cards8[i]; 00007FFEDF0CAE8C mov rdx,qword ptr [rcx+48h] 00007FFEDF0CAE90 mov r8d,dword ptr [rdx+8] 00007FFEDF0CAE94 cmp eax,r8d 00007FFEDF0CAE97 jae 00007FFEDF0CAF0C 00007FFEDF0CAE99 movsxd r8,eax 00007FFEDF0CAE9C mov rdx,qword ptr [rdx+r8*8+10h] ; !!! 00007FFEDF0CAEA1 mov qword ptr [rsp+28h],rdx ; !!! 

In the case of C3 RyuJIT stores the target bytes in the register r11d , r10d ; in the case of C8 RyuJIT stores them on the stack ( qword ptr [rsp+28h] ). Explanation: the current version of RyuJIT (v4.6.1080.0 in my case) performs scalar replacement for structures with no more than four fields (see coreclr # 6839 ). Thus, RyuJIT C5 , C6 , C7 and C8 performance is worse than C2 , C3 , C4 performance. Be careful: this behavior may be changed in future versions of RyuJIT.

+19
source

My guess is that this is due to memory alignment.

Technical information:

Some architectures, such as the MIPS architecture, cannot actually change only one byte at a time in memory. they must load the data word into the register, mask the irrelevant bits and perform the calculations.

In fact, you can speed things up by using a regular int instead of bytes, since it generally avoids this problem.

+1
source

All Articles