You should be able to use the VFP module to complete this task. NEON and VFP have the same register bank, which means you do not need to shuffle files around the registers to take advantage of the same device, and they can also have different types of the same register bits.

Your float32x4_t is 128 bits, so it should sit in the Quad (Q) register. If you use only the inner hand, you do not know which one you are using. The problem is that if it sits above 4, VFP cannot consider it as a single accuracy (for a curious reader: I kept it simple, since there are differences between versions of VFP, and this is the minimum requirement). Therefore, it would be better to move your float32x4_t to a fixed register, such as Q0 . After that, you can simply sum the registers like S0, S1, S2 with vadd.f32 and return the result back to the ARM register.
Some warnings ... VFP and NEON are theoretically different executive units using the same register bank and pipeline. I’m not sure that this approach is better than others, I don’t need to say, but again, you have to do a benchmark. Also, this approach is not optimized with a neon internal, so you probably need to create your own code with a built-in assembly.
I made a simple snippet to see how this might look, and I came up with the following:
#include "arm_neon.h" float32_t sum3() { register float32x4_t v asm ("q0"); float32_t ret; asm volatile( "vadd.f32 s0, s1\n" "vadd.f32 s0, s2\n" "vmov %[ret], s0\n" : [ret] "=r" (ret) : :); return ret; }
objdump it looks (compiled with gcc -O3 -mfpu = neon -mfloat-abi = softfp)
00000000 <sum3>: 0: ee30 0a20 vadd.f32 s0, s0, s1 4: ee30 0a01 vadd.f32 s0, s0, s2 8: ee10 3a10 vmov r0, s0 c: 4770 bx lr e: bf00 nop
I really would like to hear your impressions if you give it!