Summing 3 tracks in NEON float32x4_t

I vecturize the inner loop with ARM NEON intrinsics (llvm, iOS). I usually use float32x4_t s. My calculations end with the need to summarize three of the four floats in this vector.

I can go back to C floats at this moment and vst1q_f32 to get the four values ​​and add the three that I need. But I believe that this could be more efficient if there is a way to do it directly with a vector in a team or two, and then just get one strip result, but I could not figure out any clear path for this.

I am new to NEON programming, and the existing “documentation” is pretty terrifying. Any ideas? Thanks!

+7
source share
3 answers

You should be able to use the VFP module to complete this task. NEON and VFP have the same register bank, which means you do not need to shuffle files around the registers to take advantage of the same device, and they can also have different types of the same register bits.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/ch05s03s02.html

Your float32x4_t is 128 bits, so it should sit in the Quad (Q) register. If you use only the inner hand, you do not know which one you are using. The problem is that if it sits above 4, VFP cannot consider it as a single accuracy (for a curious reader: I kept it simple, since there are differences between versions of VFP, and this is the minimum requirement). Therefore, it would be better to move your float32x4_t to a fixed register, such as Q0 . After that, you can simply sum the registers like S0, S1, S2 with vadd.f32 and return the result back to the ARM register.

Some warnings ... VFP and NEON are theoretically different executive units using the same register bank and pipeline. I’m not sure that this approach is better than others, I don’t need to say, but again, you have to do a benchmark. Also, this approach is not optimized with a neon internal, so you probably need to create your own code with a built-in assembly.

I made a simple snippet to see how this might look, and I came up with the following:

 #include "arm_neon.h" float32_t sum3() { register float32x4_t v asm ("q0"); float32_t ret; asm volatile( "vadd.f32 s0, s1\n" "vadd.f32 s0, s2\n" "vmov %[ret], s0\n" : [ret] "=r" (ret) : :); return ret; } 

objdump it looks (compiled with gcc -O3 -mfpu = neon -mfloat-abi = softfp)

 00000000 <sum3>: 0: ee30 0a20 vadd.f32 s0, s0, s1 4: ee30 0a01 vadd.f32 s0, s0, s2 8: ee10 3a10 vmov r0, s0 c: 4770 bx lr e: bf00 nop 

I really would like to hear your impressions if you give it!

+4
source

Can you zero out the fourth element? Perhaps just by copying it and using vset_lane_f32 ?

If so, you can use the answers from Summarize all elements in a quadratic vector in an ARM assembly with NEON , for example:

 float32x2_t r = vadd_f32(vget_high_f32(input), vget_low_f32(input)); return vget_lane_f32(vpadd_f32(r, r), 0); // vpadd adds adjacent elements 

While this actually does a little more work than you need, it would be easier to just extract the three floats with vget_lane_f32 and add them.

+3
source

It looks like you want to use (some version) VLD1 to load zero into your extra lane (if you cannot establish that it is already zero), and then two VPADDL instructions to sum the four lanes into two and two then two lanes in one.

+2
source

All Articles