All about unacceptable memory access. Insufficient readiness latency for memory is greater than latency for aligned memory. To complete the experiment, add the structures Cards3 , Cards4 and so on. And let's see how the corresponding arrays are represented in memory.

Then upgrade your test.
- We will use BenchmarkDotNet (this tool will perform many benchmarks, such as warm-up, automatic selection of the number of iterations, statistics calculation, etc.).
- We will run our test for all arrays of
Cards2 .. Cards8 , not just 3 of them. - We will also check all JIT compilers for the full .NET Framework (LegacyJIT-x86, LegacyJIT-x64, RyuJIT-x64) and Mono.
Here is my environment:
Host Process Environment Information: BenchmarkDotNet.Core=v0.9.9.0 OS=Microsoft Windows NT 6.2.9200.0 Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8 Frequency=2728068 ticks, Resolution=366.5598 ns, Timer=TSC CLR1=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT] CLR2=Mono JIT compiler version 4.4.0, Arch=32-bit GC=Concurrent Workstation JitModules=clrjit-v4.6.1080.0
And my results are:
Method | Platform | Jit | Toolchain | Runtime | Median | StdDev | ------- |--------- |---------- |---------- |-------- |---------- |---------- | C2 | Host | Host | Mono | Mono | 3.9230 ns | 0.0532 ns | C3 | Host | Host | Mono | Mono | 4.8223 ns | 0.0920 ns | C4 | Host | Host | Mono | Mono | 5.9149 ns | 0.1207 ns | C5 | Host | Host | Mono | Mono | 6.3981 ns | 0.0913 ns | C6 | Host | Host | Mono | Mono | 7.1179 ns | 0.1222 ns | C7 | Host | Host | Mono | Mono | 7.6318 ns | 0.1269 ns | C8 | Host | Host | Mono | Mono | 8.4650 ns | 0.1497 ns | C2 | X64 | LegacyJit | Host | Host | 2.3515 ns | 0.0150 ns | C3 | X64 | LegacyJit | Host | Host | 4.2553 ns | 0.0700 ns | C4 | X64 | LegacyJit | Host | Host | 1.4366 ns | 0.0385 ns | C5 | X64 | LegacyJit | Host | Host | 2.3688 ns | 0.0359 ns | C6 | X64 | LegacyJit | Host | Host | 2.3684 ns | 0.0404 ns | C7 | X64 | LegacyJit | Host | Host | 3.0404 ns | 0.0664 ns | C8 | X64 | LegacyJit | Host | Host | 1.4510 ns | 0.0333 ns | C2 | X64 | RyuJit | Host | Host | 1.9281 ns | 0.0306 ns | C3 | X64 | RyuJit | Host | Host | 2.1183 ns | 0.0348 ns | C4 | X64 | RyuJit | Host | Host | 1.9395 ns | 0.0397 ns | C5 | X64 | RyuJit | Host | Host | 2.7706 ns | 0.0387 ns | C6 | X64 | RyuJit | Host | Host | 2.6471 ns | 0.0513 ns | C7 | X64 | RyuJit | Host | Host | 2.9743 ns | 0.0541 ns | C8 | X64 | RyuJit | Host | Host | 2.6280 ns | 0.1526 ns | C2 | X86 | LegacyJit | Host | Host | 3.0854 ns | 0.2172 ns | C3 | X86 | LegacyJit | Host | Host | 3.1627 ns | 0.1126 ns | C4 | X86 | LegacyJit | Host | Host | 3.0577 ns | 0.0929 ns | C5 | X86 | LegacyJit | Host | Host | 5.0957 ns | 0.1601 ns | C6 | X86 | LegacyJit | Host | Host | 6.1723 ns | 0.1177 ns | C7 | X86 | LegacyJit | Host | Host | 7.1155 ns | 0.0803 ns | C8 | X86 | LegacyJit | Host | Host | 3.7703 ns | 0.1276 ns |

Full source code:
using System; using System.Linq; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Attributes.Exporters; using BenchmarkDotNet.Attributes.Jobs; using BenchmarkDotNet.Running; [LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob] [RPlotExporter] public class CardBenchmarks { public const int Size = 256; private float[] result, one; private Cards2[] cards2; private Cards3[] cards3; private Cards4[] cards4; private Cards5[] cards5; private Cards6[] cards6; private Cards7[] cards7; private Cards8[] cards8; [Setup] public void Setup() { result = ArrayUtils.CreateRandomFloatArray(Size); one = ArrayUtils.CreateRandomFloatArray(size: 52); var c0 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51); var c1 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51); cards2 = CardUtls.Create2(c0, c1); cards3 = CardUtls.Create3(c0, c1); cards4 = CardUtls.Create4(c0, c1); cards5 = CardUtls.Create5(c0, c1); cards6 = CardUtls.Create6(c0, c1); cards7 = CardUtls.Create7(c0, c1); cards8 = CardUtls.Create8(c0, c1); } [Benchmark(OperationsPerInvoke = Size)] public void C2() { for (var i = 0; i < result.Length; i++) { var c = cards2[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C3() { for (var i = 0; i < result.Length; i++) { var c = cards3[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C4() { for (var i = 0; i < result.Length; i++) { var c = cards4[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C5() { for (var i = 0; i < result.Length; i++) { var c = cards5[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C6() { for (var i = 0; i < result.Length; i++) { var c = cards6[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C7() { for (var i = 0; i < result.Length; i++) { var c = cards7[i]; result[i] -= one[c.C0] + one[c.C1]; } } [Benchmark(OperationsPerInvoke = Size)] public void C8() { for (var i = 0; i < result.Length; i++) { var c = cards8[i]; result[i] -= one[c.C0] + one[c.C1]; } } } public static class ArrayUtils { private static readonly Random Rand = new Random(137); public static float[] CreateRandomFloatArray(int size) { var result = new float[size]; for (int i = 0; i < size; i++) result[i] = (float) Rand.NextDouble(); return result; } public static byte[] RandomByteArray(int size, int minValue, int maxValueInclusive) { var result = new byte[size]; for (int i = 0; i < size; i++) result[i] = (byte) Rand.Next(minValue, maxValueInclusive + 1); return result; } } public static class CardUtls { private static T[] Create<T>(int length, Func<int, T> create) => Enumerable.Range(0, length).Select(create).ToArray(); public static Cards2[] Create2(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards2 {C0 = c0[i], C1 = c1[i]}); public static Cards3[] Create3(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards3 {C0 = c0[i], C1 = c1[i]}); public static Cards4[] Create4(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards4 {C0 = c0[i], C1 = c1[i]}); public static Cards5[] Create5(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards5 {C0 = c0[i], C1 = c1[i]}); public static Cards6[] Create6(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards6 {C0 = c0[i], C1 = c1[i]}); public static Cards7[] Create7(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards7 {C0 = c0[i], C1 = c1[i]}); public static Cards8[] Create8(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards8 {C0 = c0[i], C1 = c1[i]}); } public struct Cards2 { public byte C0, C1; } public struct Cards3 { public byte C0, C1, C2; } public struct Cards4 { public byte C0, C1, C2, C3; } public struct Cards5 { public byte C0, C1, C2, C3, C4; } public struct Cards6 { public byte C0, C1, C2, C3, C4, C5; } public struct Cards7 { public byte C0, C1, C2, C3, C4, C5, C6; } public struct Cards8 { public byte C0, C1, C2, C3, C4, C5, C6, C7; } class Program { static void Main() { BenchmarkRunner.Run<CardBenchmarks>(); } }
Answer
As you can see, your benchmark is heavy, there are many different factors that affect your performance. One of the most important things is how your runtime breaks your data. For example, you will not observe the described behavior in Mono, because Mono and Full Framework have different layout algorithms (in Mono we have Marshal.SizeOf(typeof(Card2)) == 8 ).
We have Time(Card5) > Time(Card8) in full-screen mode, because Card5 produces many unsettled reads unlike Card8 (see the first image).
Update
Question from the comment :
Any idea why 3 bytes works better than 8 bytes on your checkmark RyuJIT64?
Take a look at the asm code:
; RyuJIT-x64, C3 var c = cards3[i]; 00007FFEDF0CADCE mov r10,r9 00007FFEDF0CADD1 mov r11d,dword ptr [r10+8] 00007FFEDF0CADD5 cmp eax,r11d 00007FFEDF0CADD8 jae 00007FFEDF0CAE44 00007FFEDF0CADDA movsxd r11,eax 00007FFEDF0CADDD imul r11,r11,3 00007FFEDF0CADE1 lea r10,[r10+r11+10h] 00007FFEDF0CADE6 movzx r11d,byte ptr [r10] ; !!! 00007FFEDF0CADEA movzx r10d,byte ptr [r10+1] ; !!! ; RyuJIT-x64, C8 var c = cards8[i]; 00007FFEDF0CAE8C mov rdx,qword ptr [rcx+48h] 00007FFEDF0CAE90 mov r8d,dword ptr [rdx+8] 00007FFEDF0CAE94 cmp eax,r8d 00007FFEDF0CAE97 jae 00007FFEDF0CAF0C 00007FFEDF0CAE99 movsxd r8,eax 00007FFEDF0CAE9C mov rdx,qword ptr [rdx+r8*8+10h] ; !!! 00007FFEDF0CAEA1 mov qword ptr [rsp+28h],rdx ; !!!
In the case of C3 RyuJIT stores the target bytes in the register r11d , r10d ; in the case of C8 RyuJIT stores them on the stack ( qword ptr [rsp+28h] ). Explanation: the current version of RyuJIT (v4.6.1080.0 in my case) performs scalar replacement for structures with no more than four fields (see coreclr # 6839 ). Thus, RyuJIT C5 , C6 , C7 and C8 performance is worse than C2 , C3 , C4 performance. Be careful: this behavior may be changed in future versions of RyuJIT.