Fast weighted average and variance of 10 silos

Question

Fast weighted average and variance of 10 silos

I would like to speed up some of my code, but I don't think there is a possible better way to do the following calculation:

float invSum = 1.0f / float(sum); for (int i = 0; i < numBins; ++i) { histVec[i] *= invSum; } for (int i = 0; i < numBins; ++i) { float midPoint = (float)i*binSize + binOffset; float f = histVec[i]; fmean += f * midPoint; } for (int i = 0; i < numBins; ++i) { float midPoint = (float)i*binSize + binOffset; float f = histVec[i]; float diff = midPoint - fmean; var += f * hwk::sqr(diff); }

numBins in a for-loop are usually 10, but this bit of code is called very often (80 frames per second, called at least 8 times per frame)

I tried using some SSE methods, but this speeds up this code a bit. I think I could avoid computing twice in the middle, but I'm not sure how to do this. Is there a better way to compute fmean and var?

Here is the SSE code:

 // make hist contain a multiple of 4 valid values for (int i = numBins; i < ((numBins + 3) & ~3); i++) hist[i] = 0; // find sum of bins in inHist __m128i iSum4 = _mm_set1_epi32(0); for (int i = 0; i < numBins; i += 4) { __m128i a = *((__m128i *) &inHist[i]); iSum4 = _mm_add_epi32(iSum4, a); } int iSum = iSum4.m128i_i32[0] + iSum4.m128i_i32[1] + iSum4.m128i_i32[2] + iSum4.m128i_i32[3]; //float stdevB, meanB; if (iSum == 0.0f) { stdev = 0.0; mean = 0.0; } else { // Set histVec to normalised values in inHist __m128 invSum = _mm_set1_ps(1.0f / float(iSum)); for (int i = 0; i < numBins; i += 4) { __m128i a = *((__m128i *) &inHist[i]); __m128 b = _mm_cvtepi32_ps(a); __m128 c = _mm_mul_ps(b, invSum); _mm_store_ps(&histVec[i], c); } float binSize = 256.0f / (float)numBins; float halfBinSize = binSize * 0.5f; float binOffset = halfBinSize; __m128 binSizeMask = _mm_set1_ps(binSize); __m128 binOffsetMask = _mm_set1_ps(binOffset); __m128 fmean4 = _mm_set1_ps(0.0f); for (int i = 0; i < numBins; i += 4) { __m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i); __m128 idx_m128 = _mm_cvtepi32_ps(idx4); __m128 histVec4 = _mm_load_ps(&histVec[i]); __m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask); fmean4 = _mm_add_ps(fmean4, _mm_mul_ps(histVec4, midPoint4)); } fmean4 = _mm_hadd_ps(fmean4, fmean4); // 01 23 01 23 fmean4 = _mm_hadd_ps(fmean4, fmean4); // 0123 0123 0123 0123 float fmean = fmean4.m128_f32[0]; //fmean4 = _mm_set1_ps(fmean); __m128 var4 = _mm_set1_ps(0.0f); for (int i = 0; i < numBins; i+=4) { __m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i); __m128 idx_m128 = _mm_cvtepi32_ps(idx4); __m128 histVec4 = _mm_load_ps(&histVec[i]); __m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask); __m128 diff4 = _mm_sub_ps(midPoint4, fmean4); var4 = _mm_add_ps(var4, _mm_mul_ps(histVec4, _mm_mul_ps(diff4, diff4))); } var4 = _mm_hadd_ps(var4, var4); // 01 23 01 23 var4 = _mm_hadd_ps(var4, var4); // 0123 0123 0123 0123 float var = var4.m128_f32[0]; stdev = sqrt(var); mean = fmean; }

I could have done something wrong, because I do not have much improvement, as I expected. Is there something in the SSE code that can slow down the process?

(editor's note: part of the SSE of this question was originally asked as https://stackoverflow.com/a/228234/... which was closed as a duplicate.)

+5

c ++ optimization sse variance

trexgris Aug 05 '15 at 13:28

source share

4 answers

Peter Cordes · Answer 1 · 2015-08-05T21:35:20+0000

I just realized that your data array starts as an int array, since you did not have declarations in your code. I can see in the SSE version that you start with integers and only save its floating point version.

Keeping all integers, we will ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); counter-vector with the simple answer ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); Aki Suihkonen, who has some conversions that should allow him to optimize a lot better. In particular, an autointervalizer should be able to do more even without -ffast-math . In fact, everything is going well. You could do better with intrinsics, especially. saving a 32bit vector multiplies and shortens the dependency chain.

My old answer based on trying to optimize your code as written, assuming FP input :

You can combine all three loops into one using the @Jason algorithm associated with . This may not be very beneficial since it is related to separation. For a small number of boxes, perhaps just a few cycles.

Start by reading the manuals at http://agner.org/optimize/ . A couple of methods in his optimization optimization guide will speed up your SSE attempt (which I edited for you in this question).

merge your loops where possible, so you do more with the data each time it is loaded / saved.
multiple batteries to hide the latency of the dependency chains associated with the cycle. (Even adding FP takes 3 cycles on the latest Intel processors.) This does not apply to really short arrays, such as your case.
instead of converting int-> float at each iteration, use the float loop counter, as well as the int loop counter. (add the _mm_set1_ps(4.0f) vector for each iteration.) _mm_set... with args variables is something that should be avoided in loops when possible. Several instructions are required (especially when each setr argument must be evaluated separately.)

gcc -O3 manages to auto-vectorize the first loop, but not the others. Using -O3 -ffast-math it automatically increases the number of vectors. -ffast-math allows you to perform FP operations in a different order than the code indicates. for example, adding an array of 4 vector elements and combining only 4 batteries at the end.

By telling gcc that the input pointer is aligned to 16, gc will automatically loop around at a much lower cost (without scalar loops for uneven parts).

 // return mean float fpstats(float histVec[], float sum, float binSize, float binOffset, long numBins, float *variance_p) { numBins += 3; numBins &= ~3; // round up to multiple of 4. This is just a quick hack to make the code fast and simple. histVec = (float*)__builtin_assume_aligned(histVec, 16); float invSum = 1.0f / float(sum); float var = 0, fmean = 0; for (int i = 0; i < numBins; ++i) { histVec[i] *= invSum; float midPoint = (float)i*binSize + binOffset; float f = histVec[i]; fmean += f * midPoint; } for (int i = 0; i < numBins; ++i) { float midPoint = (float)i*binSize + binOffset; float f = histVec[i]; float diff = midPoint - fmean; // var += f * hwk::sqr(diff); var += f * (diff * diff); } *variance_p = var; return fmean; }

gcc is generating some weird code for the second loop.

  # broadcasting fmean after the 1st loop subss %xmm0, %xmm2 # fmean, D.2466 shufps $0, %xmm2, %xmm2 # vect_cst_.16 .L5: ## top of 2nd loop movdqa %xmm3, %xmm5 # vect_vec_iv_.8, vect_vec_iv_.8 cvtdq2ps %xmm3, %xmm3 # vect_vec_iv_.8, vect__32.9 movq %rcx, %rsi # D.2465, D.2467 addq $1, %rcx #, D.2465 mulps %xmm1, %xmm3 # vect_cst_.11, vect__33.10 salq $4, %rsi #, D.2467 paddd %xmm7, %xmm5 # vect_cst_.7, vect_vec_iv_.8 addps %xmm2, %xmm3 # vect_cst_.16, vect_diff_39.15 mulps %xmm3, %xmm3 # vect_diff_39.15, vect_powmult_53.17 mulps (%rdi,%rsi), %xmm3 # MEM[base: histVec_10, index: _107, offset: 0B], vect__41.18 addps %xmm3, %xmm4 # vect__41.18, vect_var_42.19 cmpq %rcx, %rax # D.2465, bnd.26 ja .L8 #, ### <--- This is insane. haddps %xmm4, %xmm4 # vect_var_42.19, tmp160 haddps %xmm4, %xmm4 # tmp160, vect_var_42.21 .L2: movss %xmm4, (%rdx) # var, *variance_p_44(D) ret .p2align 4,,10 .p2align 3 .L8: movdqa %xmm5, %xmm3 # vect_vec_iv_.8, vect_vec_iv_.8 jmp .L5 #

Therefore, instead of just jumping back to the beginning of each iteration, gcc decides to go forward to copy the register, and then unconditionally jmp return to the beginning of the loop. The uop buffer buffer can remove the initial overhead of this nonsense, but gcc had to structure the loop so that it does not copy xmm5-> xmm3 and then xmm3-> xmm5 for each iteration, because it is stupid. It should have a conditional jump only to the top of the loop.

Also, pay attention to the gcc technique used to get the float version of the loop counter: start with the integer vector 1 2 3 4 and add set1_epi32(4) . Use this as input for packed int-> float cvtdq2ps . In Intel HW, this command runs on the FP-add port and has 3 loop delays, the same as packed FP add. gcc prob. it would be better to just add the vector set1_ps(4.0) , even if it creates a chain of dependencies with a cycle with three cycles, instead of 1 cycle vector int add, with a 3 conversion cycle that opens at each iteration.

few iterations

You say that it will often be used on exactly 10 boxes? A special version in just 10 bins can give great speedup, avoiding all the overhead of the cycle and keeping everything in registers.

With this small problem size, you can have the weight of the FP just sitting there in memory, instead of having to sort them out with the integer-> float conversion every time.

In addition, 10 bins will mean many horizontal operations relative to the number of vertical operations, since you only have 2 and a half vectors that are worth the data.

If exactly 10 is really common, specialize the version for this. If non-16 is used, specialize in this option. (They can and should share an array const float weights[] = { 0.0f, 1.0f, 2.0f, ...}; )

You probably want to use the built-in tools for specialized versions with minor problems, and not for automatic vectorization.

Having zero fill after the end of the payload in your array may still be a good idea in your specialized version. However, you can load the last 2 floats and clear the upper case 64b of the vector register with the movq instruction. ( __m128i _mm_cvtsi64_si128 (__int64 a) ). Drop it to __m128 and you will go well.

Bill rose · Answer 2 · 2015-08-05T14:24:19+0000

As mentioned in Peterchen, these operations are very trivial for modern desktop processors. The function is linear, i.e. O (n). What is the typical size of numBins ? If it is quite large (say, more than 1,000,000), then parallelization will help. It could just be using the OpenMP library. If numBins starts to approach MAXINT , you can consider GPGPU as an option (CUDA / OpenCL).

All you need is try profiling your application. Most likely, if there is a performance limitation, this is not in this method. Michael Abrash: the definition of "high-performance code" helped me determine when / when to optimize:

Before we can create high-performance code, we must understand what high performance is. The goal (not always achieved) in creating high-performance software is to make the software capable of performing its assigned tasks so quickly that it responds instantly as far as the user is concerned. In other words, high-performance code should work perfectly so fast that further code improvement would be pointless. Please note that the above definition most categorically does not mean making the software as fast as possible.

Link: Black Book Graphics Programming

Aki suihkonen · Answer 3 · 2015-08-06T07:07:27+0000

General computed function

 std = sqrt(SUM_i { hist[i]/sum * (midpoint_i - mean_midpoint)^2 })

Using identity

 Var (aX + b) = Var (X) * a^2

can significantly reduce the complexity of the entire operation

1) the middle point of the hopper does not need an offset b
2) there is no need to pre-scale bin array elements with cell width

and

3) there is no need to normalize the histogram records with the inverse sum

The optimized calculation is as follows

 float calcVariance(int histBin[], float binWidth) { int i; int sum = 0; int mid = 0; int var = 0; for (i = 0; i < 10; i++) { sum += histBin[i]; mid += i*histBin[i]; } float inv_sum = 1.0f / (float)sum; float mid_sum = mid * inv_sum; for (i = 0; i < 10; i++) { int diff = i * sum - mid; // because mid is prescaled by sum var += histBin[i] * diff * diff; } return sqrt(float(var) / (float)(sum * sum * sum)) * binWidth; }

Minor changes are required if it is float histBin[] ;

In addition, I add a second bin size to the bit to multiply by 4 for better vectorization.

EDIT

Another way to calculate this with float in the inner loop:

  float inv_sum = 1.0f / (float)sum; float mid_sum = mid * inv_sum; float var = 0.0f; for (i = 0; i < 10; i++) { float diff = (float)i - mid_sum; var += (float)histBin[i] * diff * diff; } return sqrt(var * inv_sum) * binWidth;

Yves daoust · Answer 4 · 2015-08-07T06:39:39+0000

Scale only for global results and keep integers as long as possible.

Group all calculations in one cycle using Σ(Xm)²/N = ΣX²/N - m² .

 // Accumulate the histogram int mean= 0, var= 0; for (int i = 0; i < numBins; ++i) { mean+= i * histVec[i]; var+= i * i * histVec[i]; } // Compute the reduced mean and variance float fmean= (float(mean) / sum); float fvar= float(var) / sum - fmean * fmean; // Rescale fmean= fmean * binSize + binOffset; fvar= fvar * binSize * binSize;

The required integer type will depend on the maximum value in the cells. SSE loop optimization can use the _mm_madd_epi16 .

If the number of cells is less than 10, consider running a full cycle. Precommute vectors i and i² in the table.

In a happy case, when data fits in 16 bits and amounts in 32 bits, the accumulation is done using something like

  static short I[16]= { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0 }; static short I2[16]= { 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 0, 0, 0, 0, 0, 0 }; // First group __m128i i= _mm_load_si128((__m128i*)&I[0]); __m128i i2= _mm_load_si128((__m128i*)&I2[0]); __m128i h= _mm_load_si128((__m128i*)&inHist[0]); __m128i mean= _mm_madd_epi16(i, h); __m128i var= _mm_madd_epi16(i2, h); // Second group i= _mm_load_si128((__m128i*)&I[8]); i2= _mm_load_si128((__m128i*)&I2[8]); h= _mm_load_si128((__m128i*)&inHist[8]); mean= _mm_add_epi32(mean, _mm_madd_epi16(i, h)); var= _mm_add_epi32(var, _mm_madd_epi16(i2, h));

ATTENTION: untested

Fast weighted average and variance of 10 silos

few iterations

More articles: