Fastest way to test 128-bit NEON register for value 0 using built-in functions?

I am looking for the fastest way to check if the NEON register contains 128 of all zeros using NEON intrinsics. I am currently using 3 OR operations and 2 MOV operations:

uint32x4_t vr = vorrq_u32(vcmp0, vcmp1); uint64x2_t v0 = vreinterpretq_u64_u32(vr); uint64x1_t v0or = vorr_u64(vget_high_u64(v0), vget_low_u64(v0)); uint32x2_t v1 = vreinterpret_u32_u64 (v0or); uint32_t r = vget_lane_u32(v1, 0) | vget_lane_u32(v1, 1); if (r == 0) { // do stuff } 

This translates gcc to the following assembler code:

 VORR q9, q9, q10 VORR d16, d18, d19 VMOV.32 r3, d16[0] VMOV.32 r2, d16[1] VORRS r2, r2, r3 BEQ ... 

Does anyone have an idea for a faster way?

+5
source share
4 answers

Although this answer may be a bit late, there is an easy way to run a test with only three instructions and no extra registers:

 inline uint32_t is_not_zero(uint32x4_t v) { uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v)); return vget_lane_u32(vpmax_u32(tmp, tmp), 0); } 

The return value will be nonzero if a bit was set in the 128-bit NEON register.

+6
source

If you are targeting AArch64 NEON, you can use the following to get the value to test with just two instructions:

 inline uint64_t is_not_zero(uint32x4_t v) { uint64x2_t v64 = vreinterpretq_u64_u32(v); uint32x2_t v32 = vqmovn_u64(v64); uint64x1_t result = vreinterpret_u64_u32(v32); return result[0]; } 
+2
source

You seem to be looking for insides, and this is the way:

 inline bool is_zero(int32x4_t v) noexcept { v = v == int32x4{}; return !int32x2_t( vtbl2_s8( int8x8x2_t{ int8x8_t(vget_low_s32(v)), int8x8_t(vget_high_s32(v)) }, int8x8_t{0, 4, 8, 12} ) )[0]; } 

Niels Piprenbrink's answer has the disadvantage of assuming that QC, the cumulative saturation flag, will be clear.

+1
source

If you have AArch64, you can make it even easier. They have a new manual designed for this.

 inline uint32_t is_not_zero(uint32x4_t v) { return vaddvq_u32(v); } 
0
source

All Articles