How can I vectorize an IF block using ARM Neon properties?

I want to process a large array of floating point numbers in an ARM processor, using Neon technology to calculate them four at a time. Everything is great for operations such as addition and multiplication, but what if my calculations go into the IF block? Example:

// In the non-vectorized original code, A is an array of many floating-point // numbers, which are calculated one at a time. Now they're packed // into a vector and processed four at a time ...calculate A... if (A > 10.f) { A = A+5.f; } else { A = A+10.f; } 

Now, what IF branch to execute? What if some of the values ​​in the processed vector are greater than 10, and some less? Is it even possible to vectorize such code?

0
source share
3 answers

If-else slaloms are a nightmare for almost all processors, especially for vector machines such as NEON, which itself does not have a conditional branch.

Therefore, we apply “impatient execution” to such problems.

  • Boolean mask is created
  • Both if and else tags are calculated
  • The "correct" result is selected by the mask

I think it will not be a problem to convert the aarch32 code below to intrinsics.

 //aarch32 vadd.f32 vecElse, vecA, vecTen // vecTen contains 10.0f vcgt.f32 vecMask, vecA, vecTen vadd.f32 vecA, vecA, vecFive vbif vecA, vecElse, vecMask //aarch64 fadd vecElse.4s, vecA.4s, vecTen.4s fcmgt vecMask.4s, vecA.4s, vecTen.4s fadd vecA.4s, vecA.4s, vecFive.4s bif vecA.16b, vecElse.16b, vecMask.16b 
+1
source

I will add to the answers so far, describing how to encode it in neon functions.

  • In general, you are not following the logic of an IF block based on the contents of a parallel register because one value may require one branch of the IF block, and another value in the same register may require a different one. “Horrible execution” means first performing all possible calculations and then deciding which results to actually use on the tracks. (Remember that you won nothing by performing a Neon calculation for only one register band. Any calculation that should be performed at all is performed for all 2 or 4 bands.)

  • To perform IF-based computation, use neon conditional functions, for example. more than to create a bitmask, and then the select function to fill in the final result according to the bitmask

double aval [2] = {11.5, 9.5};

 float64x2_t AA= vld1q_f64(aval); // an array with two 64-bit double values float64x2 TEN= vmovq_n_f64(10.f); // load a constant into a different array float64x2 FIVE= vmovq_n_f64(5.f); // load a constant into a different array // Do both of the computations float64x2 VALIFTRUE = vaddq_f64(AA, TEN); // {21.5, 19.5} float64x2 VALIFFALSE = vaddq_f64(AA, FIVE); // {16.5, 14.5} uint64x2_t IF1 = vcgtq_f64 (AA, TEN); // comparison "(if A > 10.)" 

The return value of vcgtq_f64 is not a set of doubles, but two 64-bit unsigned integers. They are actually a mask that can be used by bitwise selection functions such as vbslq_f64. The first 64 bits of IF1 are all 1 (the larger condition was true), and the second 64 bits are all 0.

 AA = vbslq_f64(IF1, VALIFTRUE, VALIFFALSE); // {21.5, 14.5} 

... and each AA band is filled with either VALIFTRUE or VALIFFALSE for that band, if necessary.

  1. What if impatient execution is too slow - computing in one branch is very expensive in CPU time, and you want to avoid them at all if you can? You will need to verify that this branching condition is not true for any of the vector tracks, and then skip the calculations using the correct "if" operator. Perhaps someone can comment on how well this works in practice.
+1
source

In general, with the SIMD branch logic, you use a comparison mask and then choose alternative results accordingly. I will give pseudo code for your example, and you can convert it to intrinsics or asm as needed:

 v5 = vector(5) // set up some constant vectors v10 = vector(10) vMask = compare_gt(vA, v10) // generate mask for vector compare A > 10 va = add(vA, v10) // vA = vA + 10 (all elements, unconditionally) vtemp = and(v5, vMask) // generate temp vector of 5 and 0 values based on mask va = sub(vA, vTemp) // subtract 5 from elements which are <= 10 
0
source

All Articles