Test, test, test, disassemble, disassemble, disassemble.
Simple, proper integer size.
unsigned int fun_one (unsigned int a)
{
return ((a & 7) +1);
}
unsigned int fun_two (unsigned int * a)
{
return ((* a & 7) +1);
}
There is no optimization, you have one additional instruction when passing by reference to download a value to this address in order to do something with it.
00000000:
0: e52db004 push {fp}; (str fp, [sp, # -4]!)
4: e28db000 add fp, sp, # 0
8: e24dd00c sub sp, sp, # 12
c: e50b0008 str r0, [fp, # -8]
10: e51b3008 ldr r3, [fp, # -8]
14: e2033007 and r3, r3, # 7
18: e2833001 add r3, r3, # 1
1c: e1a00003 mov r0, r3
20: e28bd000 add sp, fp, # 0
24: e49db004 pop {fp}; (ldr fp, [sp], # 4)
28: e12fff1e bx lr
0000002c:
2c: e52db004 push {fp}; (str fp, [sp, # -4]!)
30: e28db000 add fp, sp, # 0
34: e24dd00c sub sp, sp, # 12
38: e50b0008 str r0, [fp, # -8]
3c: e51b3008 ldr r3, [fp, # -8]
40: e5933000 ldr r3, [r3]
44: e2033007 and r3, r3, # 7
48: e2833001 add r3, r3, # 1
4c: e1a00003 mov r0, r3
50: e28bd000 add sp, fp, # 0
54: e49db004 pop {fp}; (ldr fp, [sp], # 4)
58: e12fff1e bx lr
Optimization, -O1 - O3 gave the same result. And you still lose the instruction loading the value.
00000000:
0: e2000007 and r0, r0, # 7
4: e2800001 add r0, r0, # 1
8: e12fff1e bx lr
0000000c:
c: e5900000 ldr r0, [r0]
10: e2000007 and r0, r0, # 7
14: e2800001 add r0, r0, # 1
18: e12fff1e bx lr
And this will continue as for any size of the same size that you can pass. 64-bit integers, you are still writing extra loading of instruction and memory cycles from the link to the registers to work. Any array of things that you really cannot do at a cost, can you? But a structure that you can use and which is part of the structure, link or not, will require some addressing.
typedef struct
{
unsigned int a;
unsigned int b;
char c [4];
} ruct;
unsigned int fun_one (ruct a)
{
return ((ac [3] & 7) +1);
}
unsigned int fun_two (ruct * a)
{
return ((a-> c [3] & 7) +1);
}
Without optimization, we start with 12 instructions. I would have to look at it more to decide if more clock cycles would burn than the other.
00000000:
0: e52db004 push {fp}; (str fp, [sp, # -4]!)
4: e28db000 add fp, sp, # 0
8: e24dd014 sub sp, sp, # 20
c: e24b3010 sub r3, fp, # 16
10: e8830007 stm r3, {r0, r1, r2}
14: e55b3005 ldrb r3, [fp, # -5]
18: e2033007 and r3, r3, # 7
1c: e2833001 add r3, r3, # 1
20: e1a00003 mov r0, r3
24: e28bd000 add sp, fp, # 0
28: e49db004 pop {fp}; (ldr fp, [sp], # 4)
2c: e12fff1e bx lr
00000030:
30: e52db004 push {fp}; (str fp, [sp, # -4]!)
34: e28db000 add fp, sp, # 0
38: e24dd00c sub sp, sp, # 12
3c: e50b0008 str r0, [fp, # -8]
40: e51b3008 ldr r3, [fp, # -8]
44: e5d3300b ldrb r3, [r3, # 11]
48: e2033007 and r3, r3, # 7
4c: e2833001 add r3, r3, # 1
50: e1a00003 mov r0, r3
54: e28bd000 add sp, fp, # 0
58: e49db004 pop {fp}; (ldr fp, [sp], # 4)
5c: e12fff1e bx lr
But look what happens with optimization. The structure was such that it corresponded to the registers during the passage.
00000000:
0: e24dd010 sub sp, sp, # 16
4: e28d3004 add r3, sp, # 4
8: e8830007 stm r3, {r0, r1, r2}
c: e5dd100f ldrb r1, [sp, # 15]
10: e2010007 and r0, r1, # 7
14: e2800001 add r0, r0, # 1
18: e28dd010 add sp, sp, # 16
1c: e12fff1e bx lr
00000020:
20: e5d0100b ldrb r1, [r0, # 11]
24: e2010007 and r0, r1, # 7
28: e2800001 add r0, r0, # 1
2c: e12fff1e bx lr
Unfortunately, gcc didnβt optimize this task very well, it could do a shift and one command by r3, add and bx, lr, three instructions, beating a pass by reference.
You need to know the compiler and interface, does it pass arguments to registers or is it always on the stack? If registers are used, what do they do if your arguments need more space than the reserved registers can handle, does they fill them, then use the stack, does it only use the stack and registers? Whether it holds a pointer to the memory containing the argument, passes the style of the link, but the one that the passed value is protected.
You should also look beyond the scope of individual functions regarding how much memory and register work must occur in order to prepare a function call. Passing by reference for an example structure will be one load or immediately populate one register with the address of the structure. Skipping the structure value, in the case of ARM, will be one instruction for loading three registers with the structure, but this requires three clock cycles (or 6 or 2, depending on the amba / axi bus). Other processors may cost you three instructions plus a data cycle for each register. Thus, even if gcc did a better job optimizing the example of passing through the value structure, following a link might just exit it with a clock cycle or two, but it depends a lot on how the code looks in the calling function. To really know that you need to test it by pinpointing the code and figuring out why it gets faster or slower when you configure it.