As @mike z pointed out, you need to make sure that you are in release mode and aim for the 64-bit version, otherwise RuyJIT, a compiler that supports SIMD, will not work (so far it only supports 64-bit architecture). In addition, pre-execution validation is always good practice to use using:
Vector.IsHardwareAccelerated;
In addition, you do not need to use a for loop to create an array first before creating the vector. You simply create a vector from the original source array using the vector<int>(int[] array,int index) constructor vector<int>(int[] array,int index) .
yield return new Vector<int>(source, i);
instead
var items = new int[vecCount]; for (int k = 0; k < vecCount; k++) { items[k] = source[i + k]; } yield return new Vector<int>(items);
Thus, I was able to get an almost 3.7x performance increase with an arbitrarily created large array.
Also, if you changed your method to one that directly calculates the sum as soon as it gets the value new Vector<int>(source, i) , for example:
private static int GetSIMDVectorsSum(int[] source) { int vecCount = Vector<int>.Count; int i = 0; int end_state = source.Length; Vector<int> temp = Vector<int>.Zero; for (; i < end_state; i += vecCount) { temp += new Vector<int>(source, i); } return Vector.Dot<int>(temp, Vector<int>.One); }
Here, productivity rises more sharply. I managed to get a 16x performance increase over vals.Aggregate(0, (a, i) => a + i) in my tests.
However, from a theoretical point of view, if, for example, Vector<int>.Count returns 4, then anything above the 4x magnification indicates that you are comparing a vector version with relatively unoptimized code.
This will be part of vals.Aggregate(0, (a, i) => a + i) in your case. Thus, basically, there is enough room for optimization here.
When I replace it with a trivial loop
private static int no_vec_sum(int[] vals) { int end = vals.Length; int temp = 0; for (int i = 0; i < end; i++) { temp += vals[i]; } return temp; }
I get only a 1.5x boost . However, an improvement in this particular case, given the simplicity of the operation.
Needless to say, Large arrays are required for the vectorized version to overcome the overhead caused by creating a new Vector<int>() at each iteration.