Using SIMD operation with C # in .NET framework 4.6 is slower

I'm currently trying to calculate the sum of all the values โ€‹โ€‹in a huge array using only C # and using SIMD to compare performance, and the SIMD version is much slower. Please see the code snippets below and let me know if I missed something. "vals" is a huge array that is read from the image file and omits this part to support it.

var watch1 = new Stopwatch(); watch1.Start(); var total = vals.Aggregate(0, (a, i) => a + i); watch1.Stop(); Console.WriteLine(string.Format("Total is: {0}", total)); Console.WriteLine(string.Format("Time taken: {0}", watch1.ElapsedMilliseconds)); var watch2 = new Stopwatch(); watch2.Start(); var sTotal = GetSIMDVectors(vals).Aggregate((a, i) => a + i); int sum = 0; for (int i = 0; i < Vector<int>.Count; i++) sum += sTotal[i]; watch2.Stop(); Console.WriteLine(string.Format("Another Total is: {0}", sum)); Console.WriteLine(string.Format("Time taken: {0}", watch2.ElapsedMilliseconds)); 

and GetSIMDVectors method

 private static IEnumerable<Vector<int>> GetSIMDVectors(short[] source) { int vecCount = Vector<int>.Count; int i = 0; int len = source.Length; for(i = 0; i + vecCount < len; i = i + vecCount) { var items = new int[vecCount]; for (int k = 0; k < vecCount; k++) { items[k] = source[i + k]; } yield return new Vector<int>(items); } var remaining = new int[vecCount]; for (int j = i, k =0; j < len; j++, k++) { remaining[k] = source[j]; } yield return new Vector<int>(remaining); } 
+4
source share
1 answer

As @mike z pointed out, you need to make sure that you are in release mode and aim for the 64-bit version, otherwise RuyJIT, a compiler that supports SIMD, will not work (so far it only supports 64-bit architecture). In addition, pre-execution validation is always good practice to use using:

 Vector.IsHardwareAccelerated; 

In addition, you do not need to use a for loop to create an array first before creating the vector. You simply create a vector from the original source array using the vector<int>(int[] array,int index) constructor vector<int>(int[] array,int index) .

 yield return new Vector<int>(source, i); 

instead

 var items = new int[vecCount]; for (int k = 0; k < vecCount; k++) { items[k] = source[i + k]; } yield return new Vector<int>(items); 

Thus, I was able to get an almost 3.7x performance increase with an arbitrarily created large array.

Also, if you changed your method to one that directly calculates the sum as soon as it gets the value new Vector<int>(source, i) , for example:

 private static int GetSIMDVectorsSum(int[] source) { int vecCount = Vector<int>.Count; int i = 0; int end_state = source.Length; Vector<int> temp = Vector<int>.Zero; for (; i < end_state; i += vecCount) { temp += new Vector<int>(source, i); } return Vector.Dot<int>(temp, Vector<int>.One); } 

Here, productivity rises more sharply. I managed to get a 16x performance increase over vals.Aggregate(0, (a, i) => a + i) in my tests.

However, from a theoretical point of view, if, for example, Vector<int>.Count returns 4, then anything above the 4x magnification indicates that you are comparing a vector version with relatively unoptimized code.

This will be part of vals.Aggregate(0, (a, i) => a + i) in your case. Thus, basically, there is enough room for optimization here.

When I replace it with a trivial loop

 private static int no_vec_sum(int[] vals) { int end = vals.Length; int temp = 0; for (int i = 0; i < end; i++) { temp += vals[i]; } return temp; } 

I get only a 1.5x boost . However, an improvement in this particular case, given the simplicity of the operation.

Needless to say, Large arrays are required for the vectorized version to overcome the overhead caused by creating a new Vector<int>() at each iteration.

+7
source

All Articles