Do I get a performance bonus if I try to use assembler commands for math commands instead of c

Question

Do I get a performance bonus if I try to use assembler commands for math commands instead of c

I have a loop in my application that does mathematical calculations of multiplication and addition.

I know some facts:

Android devices support armv6 processors and higher
armv6 is not supported by NEON

Does application performance on armv6 increase, including up, if I start using math assembler commands instead of m mathematical commands?

UPDATE

I need to loop with a math operation faster, this is the right way to use assembler instead of c.

UPDATE

I have this calculation:

Ry0 = (b0a0 * buffer[index] + b1a0 * Rx1 + b2a0 * Rx2 - a1a0 * Ry1 - a2a0 * Ry2);

This is the transfer function of biquad.

Can I do this calculation faster with asm?

UPDATE

buffer size 192000
variables - type float

+4

c assembly android arm

testCoder Dec 21 '12 at 20:20

source share

4 answers

The Infinite Impulse Response (IIR) functions are difficult to implement with high performance because each output element is closely dependent on the immediately preceding output element. This leads to a delay from exit to exit. This chain of dependencies hits common high-performance methods (such as SIMD, striping, and superscalar execution).

Work in the assembly is not a good approach to this from the beginning. At some point, assembly work might help. However, you have a fundamental problem: you cannot create a new output until you complete the previous output, multiply it by a factor and add the results of additional arithmetic. Therefore, the best thing you can do with this formulation is to produce one output as often as the processor can do multiplication and additions from beginning to end, even assuming that other work can be done in parallel.

Mathematically, you can rewrite IIR so that the output will subsequently depend on other outputs and inputs, and not immediately on the previous output. This uses more arithmetic, but makes it possible to perform arithmetic in parallel, thereby providing higher throughput.

On an iPhone or other iOS device, you can simply call vDSP_deq22 in the Accelerate framework. Acceleration is an Apple library, so it is not available on Android. However, maybe someone has implemented something similar.

One approach is to measure how many processor cycles each output accepts (calculate a lot, divide the time by the number of outputs, multiply by the processor speed) by the latency in the multiplication cycles from the addition (from the documentation for the processor model used). If the accepted time coincides with the waiting time, then this arithmetic cannot be performed on this processor more quickly, and you must either accept it or find an alternative solution with different mathematics.

+8

Eric Postpischil Dec 21 '12 at 20:50

source share

You may be able to get extra speed by looking at what your compiler is doing, but this should be the last thing you do. First look at your algorithm and variable types.

Since your goal is ARMv6, the first thing I would like to do is go from floating point arithmetic to fixed point. ARMv6 usually does not support or supports slow floating point support. ARMv7 is usually better, but for ARM, fixed-point arithmetic is usually much faster than a floating point implementation.

+3

Leo Dec 21 '12 at 21:26

source share

Android supports ARMv5TE and ARMv7-A. Read the NDK docs about supported ARCH ARC and ABI, available at $NDK/docs/CPU-ARCH-ABIS.html .

ARMv5TE is by default and does not provide you any floating point hardware support, you can see the Android NDK page about it. You must add ARMv7-A support to your application to get the most support from the hardware.

ARMv6 is somewhere in the middle, and if you want to target these devices, you have to do some Android.mk tricks.

Currently, if you are coding a modern application, you are probably planning to use new devices with an ARMv7-A processor with VFPv3 and NEON. If you just want to support ARMv6, you should use ARMv5TE to cover them. If you want to take advantage of a little extra support for ARMv6, you will completely lose support for ARMv5TE.

I compiled your simple line of code using the NDK r8c and it can create me a binary as shown below. The best ARM VFP allows your application to multiply and accumulate instruction, which is fmac , and the compiler can easily fix them.

 00000000 <f>: 0: ee607aa2 fmuls s15, s1, s5 4: ed9f7a05 flds s14, [pc, #20] 8: ee407a07 fmacs s15, s0, s14 c: ee417a03 fmacs s15, s2, s6 10: ee417ae3 fnmacs s15, s3, s7 14: eeb00a67 fcpys s0, s15 18: ee020a44 fnmacs s0, s4, s8 1c: e12fff1e bx lr

It might be best to split your statement into several pieces to allow for double release, but you can do it in C.

You cannot create miracles just by using assembly, however the compiler can also create huge shit. GCC and ARM are not as good as GCC and Intel. Especially in vectorization, using NEON. It is always good to check what the compiler produces if you need to perform high-performance procedures.

+1

auselen Dec 21 '12 at 22:11

source share

Mats petersson · Accepted Answer · 2012-12-21T20:22:50+0000

Compilers are very good at their job, so if you don’t know what your compiler is doing and you know what you can do better, maybe not.

Not knowing exactly what your code is doing, it would be impossible to give a better answer.

Edit: summarize this discussion: The first step in improving performance is not to start writing assembler. The first step is to find the most efficient algorithm. Once this is done, you can look at the assembler encoding.

Do I get a performance bonus if I try to use assembler commands for math commands instead of c

More articles: