Horizontal Boolean Reduction Optimization in ARM NEON

I am experimenting with the cross-platform SIMD library ala ecmascript_simd aka SIMD.js , and part of this provides several "horizontal" SIMD operations. In particular, the API that the library offers includes the functions any(<boolN x M>) -> bool and all(<boolN x M>) -> bool , where <T x K> is the vector of K elements of type T and boolN is N -bit logical, i.e. all one or all zeros, as SSE and NEON are returned for their comparison operations.

For example, let v be <bool32 x 4> (128-bit vector), it could be the result of VCLT.S32 or something. I would like to calculate all(v) = v[0] && v[1] && v[2] && v[3] and any(v) = v[0] || v[1] || v[2] || v[3] any(v) = v[0] || v[1] || v[2] || v[3] any(v) = v[0] || v[1] || v[2] || v[3] .

This is easy with SSE, for example. movmskps will extract the high bit of each element, so all for the type indicated above becomes (with C intrinsics):

 #include<xmmintrin.h> int all(__m128 x) { return _mm_movemask_ps(x) == 8 + 4 + 2 + 1; } 

and similarly for any .

I am trying to find obvious / good / effective ways to implement this with NEON, which does not support an instruction like movmskps . There is an approach of simple extraction of each element and calculation with scalars. For instance. There is a naive method, but there is an approach for using "horizontal" operations supported by NEON, for example VPMAX and VPMIN .

 #include<arm_neon.h> int all_naive(uint32x4_t v) { return v[0] && v[1] && v[2] && v[3]; } int all_horiz(uint32x4_t v) { uint32x2_t x = vpmin_u32(vget_low_u32(v), vget_high_u32(v)); uint32x2_t y = vpmin_u32(x, x); return x[0] != 0; } 

(For the latter, you can do something similar with VPADD , which can be faster, but in principle this is the same idea.)

Can other tricks be used to implement this?


Yes, I know that horizontal operations are small with SIMD vector units. But sometimes it’s useful, for example. many SIMD implementations of the mandblebrot will work with four points at the same time and exit the inner loop when they all go beyond the permissible limits ... that require a comparison, and then the horizontal and.

+6
source share

All Articles