I just realized that your data array starts as an int array, since you did not have declarations in your code. I can see in the SSE version that you start with integers and only save its floating point version.
Keeping all integers, we will ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); counter-vector with the simple answer ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); Aki Suihkonen, who has some conversions that should allow him to optimize a lot better. In particular, an autointervalizer should be able to do more even without -ffast-math . In fact, everything is going well. You could do better with intrinsics, especially. saving a 32bit vector multiplies and shortens the dependency chain.
My old answer based on trying to optimize your code as written, assuming FP input :
You can combine all three loops into one using the @Jason algorithm associated with . This may not be very beneficial since it is related to separation. For a small number of boxes, perhaps just a few cycles.
Start by reading the manuals at http://agner.org/optimize/ . A couple of methods in his optimization optimization guide will speed up your SSE attempt (which I edited for you in this question).
merge your loops where possible, so you do more with the data each time it is loaded / saved.
multiple batteries to hide the latency of the dependency chains associated with the cycle. (Even adding FP takes 3 cycles on the latest Intel processors.) This does not apply to really short arrays, such as your case.
- instead of converting int-> float at each iteration, use the float loop counter, as well as the int loop counter. (add the
_mm_set1_ps(4.0f) vector for each iteration.) _mm_set... with args variables is something that should be avoided in loops when possible. Several instructions are required (especially when each setr argument must be evaluated separately.)
gcc -O3 manages to auto-vectorize the first loop, but not the others. Using -O3 -ffast-math it automatically increases the number of vectors. -ffast-math allows you to perform FP operations in a different order than the code indicates. for example, adding an array of 4 vector elements and combining only 4 batteries at the end.
By telling gcc that the input pointer is aligned to 16, gc will automatically loop around at a much lower cost (without scalar loops for uneven parts).
// return mean float fpstats(float histVec[], float sum, float binSize, float binOffset, long numBins, float *variance_p) { numBins += 3; numBins &= ~3; // round up to multiple of 4. This is just a quick hack to make the code fast and simple. histVec = (float*)__builtin_assume_aligned(histVec, 16); float invSum = 1.0f / float(sum); float var = 0, fmean = 0; for (int i = 0; i < numBins; ++i) { histVec[i] *= invSum; float midPoint = (float)i*binSize + binOffset; float f = histVec[i]; fmean += f * midPoint; } for (int i = 0; i < numBins; ++i) { float midPoint = (float)i*binSize + binOffset; float f = histVec[i]; float diff = midPoint - fmean; // var += f * hwk::sqr(diff); var += f * (diff * diff); } *variance_p = var; return fmean; }
gcc is generating some weird code for the second loop.
Therefore, instead of just jumping back to the beginning of each iteration, gcc decides to go forward to copy the register, and then unconditionally jmp return to the beginning of the loop. The uop buffer buffer can remove the initial overhead of this nonsense, but gcc had to structure the loop so that it does not copy xmm5-> xmm3 and then xmm3-> xmm5 for each iteration, because it is stupid. It should have a conditional jump only to the top of the loop.
Also, pay attention to the gcc technique used to get the float version of the loop counter: start with the integer vector 1 2 3 4 and add set1_epi32(4) . Use this as input for packed int-> float cvtdq2ps . In Intel HW, this command runs on the FP-add port and has 3 loop delays, the same as packed FP add. gcc prob. it would be better to just add the vector set1_ps(4.0) , even if it creates a chain of dependencies with a cycle with three cycles, instead of 1 cycle vector int add, with a 3 conversion cycle that opens at each iteration.
few iterations
You say that it will often be used on exactly 10 boxes? A special version in just 10 bins can give great speedup, avoiding all the overhead of the cycle and keeping everything in registers.
With this small problem size, you can have the weight of the FP just sitting there in memory, instead of having to sort them out with the integer-> float conversion every time.
In addition, 10 bins will mean many horizontal operations relative to the number of vertical operations, since you only have 2 and a half vectors that are worth the data.
If exactly 10 is really common, specialize the version for this. If non-16 is used, specialize in this option. (They can and should share an array const float weights[] = { 0.0f, 1.0f, 2.0f, ...}; )
You probably want to use the built-in tools for specialized versions with minor problems, and not for automatic vectorization.
Having zero fill after the end of the payload in your array may still be a good idea in your specialized version. However, you can load the last 2 floats and clear the upper case 64b of the vector register with the movq instruction. ( __m128i _mm_cvtsi64_si128 (__int64 a) ). Drop it to __m128 and you will go well.