It seems that you want to get the sum of a specific array length, not just four float values.
In this case, your code will work, but far from optimized:
Assuming the array length is a multiple of 8 and at least 16:
vldmia {q0-q1}, [pSrc]! sub count, count, #8 loop: pld [pSrc, #32] vldmia {q3-q4}, [pSrc]! subs count, count,
- pld - being an ARM instruction, not NEON - is critical to performance. This greatly increases the cache hit rate.
Hopefully the rest of the code above is self-evident.
You will notice that this version is many times faster than your original version.
Jake 'Alquimista' LEE
source share