Simple ADD / ADC ARM assemlby not working

I have the next version of C and ASM (presumably) of the same code. What it does is load 2 128-bit ints represented by two 64-bit ints, each of which registers (the first 4 * lower 32 bits, then 4 * higher 32 bits) and ADD / ADC for them. This is fairly simple code, and the ARM / ST manuals actually give the same example with 96-bit (3 ADD / ADC s).

Both versions work for simple calls (repeatedly adding (1 << x++) or 1..x). But for longer tests, the ARM node does not work (the board freezes). ATM I have no trap / debugging capability and I cannot use any printf() or liked it to find a test loss, which in any case does not matter, because there should be some basic error in the ASM version, so how version C works as expected.

I don’t understand, it is quite simple and very close to C assembly (without branching). I tried the "memory" limit (it was not necessary), I tried to keep the hyphen between the lower and upper 64-bit in the register and add this later using ADD(C).W , alignment using LDR / STR from LDRD / STRD etc. I assume the board is wrong, because some addition goes wrong and leads to a division by 0 or something like that. ASM GCC is lower and uses a similar basic technique, so I don't see a problem.

I'm really looking for the fastest way to make an addition, rather than fixing this code. It's a shame that you need to use constant register names, because there are no restrictions for specifying rX and rX+1 . It is also impossible to use as many registers as GCC, because they will be exhausted during the compilation process.

 typedef struct I128 { int64_t high; uint64_t low; } I128; I128 I128add(I128 a, const I128 b) { #if defined(USEASM) && defined(ARMx) __asm( "LDRD %%r2, %%r3, %[alo]\n" "LDRD %%r4, %%r5, %[blo]\n" "ADDS %%r2, %%r2, %%r4\n" "ADCS %%r3, %%r3, %%r5\n" "STRD %%r2, %%r3, %[alo]\n" "LDRD %%r2, %%r3, %[ahi]\n" "LDRD %%r4, %%r5, %[bhi]\n" "ADCS %%r2, %%r2, %%r4\n" "ADC %%r3, %%r3, %%r5\n" "STRD %%r2, %%r3, %[ahi]\n" : [alo] "+m" (a.low), [ahi] "+m" (a.high) : [blo] "m" (b.low), [bhi] "m" (b.high) : "r2", "r3", "r4", "r5", "cc" ); return a; #else // faster to use temp than saving low and adding to a directly I128 r = {a.high + b.high, a.low + b.low}; // check for overflow of low 64 bits, add carry to high // avoid conditionals //r.high += r.low < a.low || r.low < b.low; // actually gcc produces faster code with conditionals if(r.low < a.low || r.low < b.low) ++r.high; return r; } 

GCC C version using "armv7m-none-eabi-gcc-4.7.2 -O3 -ggdb -fomit-frame-pointer -falign-functions = 16 -std = gnu99 -march = armv7e-m":

 b082 sub sp, #8 e92d 0ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp} a908 add r1, sp, #32 e881 000c stmia.w r1, {r2, r3} e9dd 890e ldrd r8, r9, [sp, #56] ; 0x38 e9dd 670a ldrd r6, r7, [sp, #40] ; 0x28 e9dd 2308 ldrd r2, r3, [sp, #32] e9dd 450c ldrd r4, r5, [sp, #48] ; 0x30 eb16 0a08 adds.w sl, r6, r8 eb47 0b09 adc.w fp, r7, r9 1912 adds r2, r2, r4 eb43 0305 adc.w r3, r3, r5 45bb cmp fp, r7 bf08 it eq 45b2 cmpeq sl, r6 d303 bcc.n 8012c9a <I128add+0x3a> 45cb cmp fp, r9 bf08 it eq 45c2 cmpeq sl, r8 d204 bcs.n 8012ca4 <I128add+0x44> 2401 movs r4, #1 2500 movs r5, #0 1912 adds r2, r2, r4 eb43 0305 adc.w r3, r3, r5 e9c0 2300 strd r2, r3, [r0] e9c0 ab02 strd sl, fp, [r0, #8] e8bd 0ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp} b002 add sp, #8 4770 bx lr 

Error of my ASM version:

 b082 sub sp, #8 b430 push {r4, r5} a902 add r1, sp, #8 e881 000c stmia.w r1, {r2, r3} e9dd 2304 ldrd r2, r3, [sp, #16] e9dd 4508 ldrd r4, r5, [sp, #32] 1912 adds r2, r2, r4 416b adcs r3, r5 e9cd 2304 strd r2, r3, [sp, #16] e9dd 2302 ldrd r2, r3, [sp, #8] e9dd 4506 ldrd r4, r5, [sp, #24] 4162 adcs r2, r4 eb43 0305 adc.w r3, r3, r5 e9cd 2302 strd r2, r3, [sp, #8] 4604 mov r4, r0 c90f ldmia r1, {r0, r1, r2, r3} e884 000f stmia.w r4, {r0, r1, r2, r3} 4620 mov r0, r4 bc30 pop {r4, r5} b002 add sp, #8 4770 bx lr 
+4
source share
1 answer

I do not get a hang on your code, but it does not work either, I don’t know why. But it was very easy to fix the compiler-generated code for handling the transfer:

 I128 I128add(I128 a, const I128 b) { I128 r = {a.high + b.high, a.low + b.low}; return r; } 

becomes

 000001e4 <I128add>: 1e4: b082 sub sp, #8 1e6: b4f0 push {r4, r5, r6, r7} 1e8: e9dd 4506 ldrd r4, r5, [sp, #24] 1ec: a904 add r1, sp, #16 1ee: e881 000c stmia.w r1, {r2, r3} 1f2: e9dd 230a ldrd r2, r3, [sp, #40] ; 0x28 1f6: 1912 adds r2, r2, r4 1f8: eb43 0305 adc.w r3, r3, r5 1fc: e9dd 6704 ldrd r6, r7, [sp, #16] 200: e9dd 4508 ldrd r4, r5, [sp, #32] 204: 1936 adds r6, r6, r4 206: eb47 0705 adc.w r7, r7, r5 20a: e9c0 6700 strd r6, r7, [r0] 20e: e9c0 2302 strd r2, r3, [r0, #8] 212: bcf0 pop {r4, r5, r6, r7} 214: b002 add sp, #8 216: 4770 bx lr 

fixed add

 .thumb_func .globl test2 test2: sub sp, #8 push {r4, r5, r6, r7} ldrd r4, r5, [sp, #24] add r1, sp, #16 stmia r1, {r2, r3} ldrd r2, r3, [sp, #40] add r2, r4 adc r3, r5 ldrd r6, r7, [sp, #16] ldrd r4, r5, [sp, #32] adc r6, r4 adc r7, r5 strd r6, r7, [r0] strd r2, r3, [r0, #8] pop {r4, r5, r6, r7} add sp, #8 bx lr 

final result

 00000024 <test2>: 24: b082 sub sp, #8 26: b4f0 push {r4, r5, r6, r7} 28: e9dd 4506 ldrd r4, r5, [sp, #24] 2c: a904 add r1, sp, #16 2e: c10c stmia r1!, {r2, r3} 30: e9dd 230a ldrd r2, r3, [sp, #40] ; 0x28 34: 1912 adds r2, r2, r4 36: 416b adcs r3, r5 38: e9dd 6704 ldrd r6, r7, [sp, #16] 3c: e9dd 4508 ldrd r4, r5, [sp, #32] 40: 4166 adcs r6, r4 42: 416f adcs r7, r5 44: e9c0 6700 strd r6, r7, [r0] 48: e9c0 2302 strd r2, r3, [r0, #8] 4c: bcf0 pop {r4, r5, r6, r7} 4e: b002 add sp, #8 50: 4770 bx lr 

Note the fewer thumb2 instructions if you are not on Cortex-A with thumb support, these flash samples (cortex-m) are slow. I see that you are trying to keep push and pop two more registers, but you cost yourself more. You can take the above and still reorder loads and stores and save these two registers.

minimal testing so far. printfs show the addition of upper words where I have not seen this with your code. I'm still trying to disable the calling convention (please document your code for us more), it seems that r0 is prepared by the caller to post the result, and the rest is on the stack. I am using stellaris launch pad (cortex-m4).

+1
source

All Articles