Sum all elements in a quadratic vector in an ARM assembly with NEON

Question

Sum all elements in a quadratic vector in an ARM assembly with NEON

I'm rather new to the build, and although the shoulder information center is often useful, sometimes the instructions can be a little confusing for beginners. Basically, what I need to do is to sum 4 float values in the quadword register and store the result in one precision register. I think the VPADD instruction can do what I need, but I'm not quite sure.

+6

assembly math arm neon

A person Aug 3 '11 at 18:17

source share

3 answers

You can try this (this is not in ASM, but you should easily convert it):

 float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type)); return vget_lane_f32(vpadd_f32(r, r), 0);

In ASM, these would probably be just VADD and VPADD.

I'm not sure if this is the only way to do this (and the most optimal one), but I did not understand / did not find the best ...

PS. I am also new to NEON.

+2

kibab Aug 3 '11 at 20:13

source share

Here is the code in ASM:

  vpadd.f32 d1,d6,d7 @ q3 is register that needs all of its contents summed vadd.f32 s1,s2,s3 @ now we add the contents of d1 together (the sum) vadd.f32 s0,s0,s1 @ sum += s1;

Perhaps I forgot to mention that in C the code would look like this:

 float sum = 1.0f; sum += number1 * number2;

I have omitted multiplication from this small piece of asm code.

+2

A person Aug 05 '11 at 18:33

source share

Jake 'Alquimista' LEE · Accepted Answer · 2011-11-01T06:21:12+0000

It seems that you want to get the sum of a specific array length, not just four float values.

In this case, your code will work, but far from optimized:

many multi-line locks
unnecessary 32-bit addition to iteration

Assuming the array length is a multiple of 8 and at least 16:

vldmia {q0-q1}, [pSrc]! sub count, count, #8 loop: pld [pSrc, #32] vldmia {q3-q4}, [pSrc]! subs count, count, #8 vadd.f32 q0, q0, q3 vadd.f32 q1, q1, q4 bgt loop vadd.f32 q0, q0, q1 vpadd.f32 d0, d0, d1 vadd.f32 s0, s0, s1

pld - being an ARM instruction, not NEON - is critical to performance. This greatly increases the cache hit rate.

Hopefully the rest of the code above is self-evident.

You will notice that this version is many times faster than your original version.

Sum all elements in a quadratic vector in an ARM assembly with NEON

More articles: