I am trying to implement Gauss-Newton optimization for a specific problem with iPhone ARM using NEON. The first function below is my original C. function. The second is the NEON asm code I wrote. I ran every 100,000 times, and the NEON version takes 7-8 times longer than version C. I think downloading (vld1.32) is what takes up most of the time. I experimented by removing some instructions.
Does anyone have an understanding of this problem? Thanks!
template<class T> inline void GaussNewtonOperationJtr8x8(T Jtr[8], const TJ[8], T residual) { Jtr[0] -= J[0]*residual; Jtr[1] -= J[1]*residual; Jtr[2] -= J[2]*residual; Jtr[3] -= J[3]*residual; Jtr[4] -= J[4]*residual; Jtr[5] -= J[5]*residual; Jtr[6] -= J[6]*residual; Jtr[7] -= J[7]*residual; } inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual) { __asm__ volatile ( // load Jtr into registers "vld1.32 {d0-d3}, [%0]\n\t" // load J into registers "vld1.32 {d4-d7}, [%1]\n\t" // load residual in register "vmov.f32 s16, %2\n\t" // Jtr -= J*residual "vmls.f32 q0, q2, d8[0]\n\t" "vmls.f32 q1, q3, d8[0]\n\t" // store result "vst1.32 {d0-d3}, [%0]\n\t" // output : // input : "r"(Jtr), "r"(J), "r"(residual) // registers : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14" ); }
paul source share