Summing 3 tracks in NEON float32x4_t

Question

Summing 3 tracks in NEON float32x4_t

I vecturize the inner loop with ARM NEON intrinsics (llvm, iOS). I usually use float32x4_t s. My calculations end with the need to summarize three of the four floats in this vector.

I can go back to C floats at this moment and vst1q_f32 to get the four values and add the three that I need. But I believe that this could be more efficient if there is a way to do it directly with a vector in a team or two, and then just get one strip result, but I could not figure out any clear path for this.

I am new to NEON programming, and the existing “documentation” is pretty terrifying. Any ideas? Thanks!

+7

arm ios simd neon intrinsics

Ben zotto Dec 14 '12 at 0:50

source share

3 answers

auselen · Answer 1 · 2012-12-14T09:17:15+0000

You should be able to use the VFP module to complete this task. NEON and VFP have the same register bank, which means you do not need to shuffle files around the registers to take advantage of the same device, and they can also have different types of the same register bits.

Your float32x4_t is 128 bits, so it should sit in the Quad (Q) register. If you use only the inner hand, you do not know which one you are using. The problem is that if it sits above 4, VFP cannot consider it as a single accuracy (for a curious reader: I kept it simple, since there are differences between versions of VFP, and this is the minimum requirement). Therefore, it would be better to move your float32x4_t to a fixed register, such as Q0 . After that, you can simply sum the registers like S0, S1, S2 with vadd.f32 and return the result back to the ARM register.

Some warnings ... VFP and NEON are theoretically different executive units using the same register bank and pipeline. I’m not sure that this approach is better than others, I don’t need to say, but again, you have to do a benchmark. Also, this approach is not optimized with a neon internal, so you probably need to create your own code with a built-in assembly.

I made a simple snippet to see how this might look, and I came up with the following:

 #include "arm_neon.h" float32_t sum3() { register float32x4_t v asm ("q0"); float32_t ret; asm volatile( "vadd.f32 s0, s1\n" "vadd.f32 s0, s2\n" "vmov %[ret], s0\n" : [ret] "=r" (ret) : :); return ret; }

objdump it looks (compiled with gcc -O3 -mfpu = neon -mfloat-abi = softfp)

 00000000 <sum3>: 0: ee30 0a20 vadd.f32 s0, s0, s1 4: ee30 0a01 vadd.f32 s0, s0, s2 8: ee10 3a10 vmov r0, s0 c: 4770 bx lr e: bf00 nop

I really would like to hear your impressions if you give it!

Jesse rusak · Answer 2 · 2012-12-14T01:23:38+0000

Can you zero out the fourth element? Perhaps just by copying it and using vset_lane_f32 ?

If so, you can use the answers from Summarize all elements in a quadratic vector in an ARM assembly with NEON , for example:

 float32x2_t r = vadd_f32(vget_high_f32(input), vget_low_f32(input)); return vget_lane_f32(vpadd_f32(r, r), 0); // vpadd adds adjacent elements

While this actually does a little more work than you need, it would be easier to just extract the three floats with vget_lane_f32 and add them.

rob mayoff · Answer 3 · 2012-12-14T01:22:56+0000

It looks like you want to use (some version) VLD1 to load zero into your extra lane (if you cannot establish that it is already zero), and then two VPADDL instructions to sum the four lanes into two and two then two lanes in one.

Summing 3 tracks in NEON float32x4_t

More articles: