Does NEON ASM code run much slower than C code?

I am trying to implement Gauss-Newton optimization for a specific problem with iPhone ARM using NEON. The first function below is my original C. function. The second is the NEON asm code I wrote. I ran every 100,000 times, and the NEON version takes 7-8 times longer than version C. I think downloading (vld1.32) is what takes up most of the time. I experimented by removing some instructions.

Does anyone have an understanding of this problem? Thanks!

template<class T> inline void GaussNewtonOperationJtr8x8(T Jtr[8], const TJ[8], T residual) { Jtr[0] -= J[0]*residual; Jtr[1] -= J[1]*residual; Jtr[2] -= J[2]*residual; Jtr[3] -= J[3]*residual; Jtr[4] -= J[4]*residual; Jtr[5] -= J[5]*residual; Jtr[6] -= J[6]*residual; Jtr[7] -= J[7]*residual; } inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual) { __asm__ volatile ( // load Jtr into registers "vld1.32 {d0-d3}, [%0]\n\t" // load J into registers "vld1.32 {d4-d7}, [%1]\n\t" // load residual in register "vmov.f32 s16, %2\n\t" // Jtr -= J*residual "vmls.f32 q0, q2, d8[0]\n\t" "vmls.f32 q1, q3, d8[0]\n\t" // store result "vst1.32 {d0-d3}, [%0]\n\t" // output : // input : "r"(Jtr), "r"(J), "r"(residual) // registers : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14" ); } 
+4
source share
4 answers
  • Do not use d8-d15. Before use, they must be saved on the stack. And restored after. The compiler will accomplish this by spending valuable loops.
  • Download J to Jtr. Jtr is expected at a later stage of the pipeline than J.
  • Use VLDMIA / VSTMIA instead of VLD / VST. VLDMIA / VSTMIA is faster and has a pipeline advantage.
  • Use vector vector multiplication instead of scalar vector multiplication.
  • If you are creating a looped version, put pld at the beginning and loop around so that 64 bytes are read from each iteration pointer.

In addition to the mistakes that I mentioned above, which is typical for people new to NEON, your approach is very pleasant. You have found the most suitable instruction in vmls.

Well done.

{

 __asm__ volatile ( // load residual in register "vdup.32 q12, %2\n\t" // load J into registers "vldmia %1, {q10-q11}\n\t" // load Jtr into registers "vldmia %0, {q8-q9}\n\t" // Jtr -= J*residual "vmls.f32 q8, q10, q12\n\t" "vmls.f32 q9, q11, q12\n\t" // store result "vstmia %0, {q8-q9}\n\t" // output : // input : "r"(Jtr), "r"(J), "r"(residual) // registers : "q8", "q9", "q10", "q11", "q12" ); 
+5
source

The compiler itself optimizes the assembly generated by the C code. It simply does not translate one code to another.

What you are trying to do is do a better optimization than the compiler (oh, ow). Do you know at least what assembly code the compiler generates for the C code above? Well, you should if you want the build code to be better.

EDIT:

There is an excellent discussion on this subject: Why is ARM NEON not faster than regular C ++?

+3
source

You switch between NEON and VFP instructions. There is a penalty for this on both the Cortex-A8 and A9. Get rid of this VFP vmov.f32 instruction, and also make sure that this code is not embedded in places where the VFP code is used if there is no long run of such code to justify the pipeline context switch.

+3
source

Does your C ++ version really use floats? I can’t say because you only gave the template and did not show which instance you used. It is very strange that NEON would be much slower than Vort on Cortex-A8 for this code, but for u32s I could see that it would probably work this way.

I don’t know what ABI is, but there may be some overhead on how the remainder is passed (that is, what the compiler does to get it in this% 2 register). Try using a pointer instead and use vld1 for a singleton - you can only load one float into NEON this way.

You will get better performance from arrays if you use 16 bytes of aligned loads and storage, but you may have to play some games to make the input work this way. Unfortunately, you will never get really great performance because of this, because you do not avoid most of the latency of the vmls instruction, which is long (due to the NEON multiplication chain and adding pipelines to the end). This is worse because the dependent instruction is the repository that needs to be entered at the beginning of the NEON pipeline. Ideally, you can perform several of these operations at a time and can alternate several instances together β€” as much as you can fit in registers.

+1
source

All Articles