Is Neon on Android limited to memory access?

I programmed the processing of single floating-point arrays using Neon on the Android platform, in particular Samsung S4, and found that my Neon routines had limited access to the array data. For interests, the snippet below:

Neon

m1 =  vmulq_f32(*(float32x4_t *)&ey[i][j],*(float32x4_t *)&caey[i][j]);
                m2 =  vsubq_f32(*(float32x4_t *)&hz[i-1][j],*(float32x4_t *)&hz[i][j]);
                m3 =  vmulq_f32(*(float32x4_t *)&cbey[i][j],m2);
                m4 =  vaddq_f32(m1,m3); 
                vst1q_f32(&ey[i*je+j],m4);

Consistent

ey[i][j] = caey[i][j] * ey[i][j] + cbey[i][j] * ( hz[i-1][j] - hz[i][j] ); 

Built on an Android phone using C4droid gcc as well as AIDE-JNI. The incorrect neon code above takes a little longer than the sequential equivalent. When replacing array data with dummy const float, the code is almost 4 times faster than serial data with array data, although it, of course, will produce meaningless results (this confirms that the performance problem is related to data access). My equivalent SSE and AVX code on other platforms creates good accelerations.

1D __builtin_prefetch, .

- , Android?

+4

All Articles