SSE code optimization

I am currently developing a C module for a Java application that requires some performance improvement (see Improving Network Coding Performance ). I tried to optimize the code using SSE-intrinsics, and it runs somewhat faster than the Java version (~ 20%). However, it is still not fast enough.

Unfortunately, my experience with optimizing C code is somewhat limited. Therefore, I would like to get some ideas on how to improve the current implementation.

The inner loop, which is a hot spot, is as follows:

for (i = 0; i < numberOfGFVectorsInFragment; i++) { // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them. __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr); __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray); __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector); __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor); _mm_store_si128(encodedFragmentResultArray, updatedResultVector); encodedFragmentResultArray++; currentMessageFragmentPtr++; } 
+6
java optimization c sse vtune
source share
2 answers

Even without looking at the assembly, I can immediately say that the bottleneck is from 4-element access to the collection memory and from the packing operations _mm_set_epi32 . Internally _mm_set_epi32 , in your case, a series of unpacklo/hi instructions will probably be implemented.

Most of the β€œwork” in this loop is to package these 4 memory accesses. In the absence of SSE4.1, I would go so far as to say that a loop can be faster non-vectorized, but it unfolds.

If you want to use SSE4.1, you can try this. It may be faster; it may not be:

  int* logSumArray = (int*)(&logSumVector); __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3); 

I suggest that you unroll the loop at least 4 iterations and interleave all the instructions to give this code a chance of good execution.

What you really need is Intel AVX2 assembly / unloading instructions. But that after a few years along the way ...

+7
source share

Maybe try http://web.eecs.utk.edu/~plank/plank/papers/CS-07-593/ . Functions with a "region" in their names are supposedly fast. They do not seem to use any special sets of commands, but they may have been optimized in other ways ...

+1
source share

All Articles