Will passing a variable of an integral type to functions by reference be more efficient than by value?

I know what he said, passing a variable of any integral type, such as int, double, long double, etc. for function; this should be done by value, but I'm curious that from the point of assembly (in performance or in space) there would be no situation where an integral type variable with a size larger than pointers, such as a long double on my platform of 8 bytes and Larger than pointers 4 bytes in size will the link be more effective?

+6
c ++ performance function assembly parameter-passing
source share
4 answers

Passing a pointer / reference to an integer value greater than the native size of the pointer may well be locally optimal, but it's hard to say whether it will be globally optimal. It largely depends on the use of value. If it is really an integer and is considered to be called, it is likely that at some point the value will be loaded into one or more registers in any case (for the program to perform arithmetic by values, for example) additional overhead in the called call for dereferencing a pointer. If the interlocutor is built in by the optimizing compiler, it is possible that the compiler will simply pass an integer value divided into two registers. If, however, the called cannot be embedded (for example, if it refers to a third-party API), then the compiler cannot perform this kind of attachment, and, indeed, passing a pointer can be more efficient, although you are unlikely to find a library that there are functions that accept an integer pass by reference, if it is not so that the callee can change the value of the caller: this introduces a number of problems.

Most often, a modern optimizing compiler will come closer to an optimal solution, taking into account all these things, and it is usually best that the programmer does not try to preempt the compiler with premature optimization. In fact, this can lead to less efficient code.

The most sensible thing in most cases is to write your code in such a way as to best communicate your intent (rolling values ​​for types "value", unless the argument is - using C # terminology - semantically the "out" or " reference ") and worry about performance only if there is a performance bottleneck.

+5
source share

Test, test, test, disassemble, disassemble, disassemble.

Simple, proper integer size.

  unsigned int fun_one (unsigned int a)
 {
     return ((a & 7) +1);
 }

 unsigned int fun_two (unsigned int * a)
 {
     return ((* a & 7) +1);
 }

There is no optimization, you have one additional instruction when passing by reference to download a value to this address in order to do something with it.

  00000000:
    0: e52db004 push {fp};  (str fp, [sp, # -4]!)
    4: e28db000 add fp, sp, # 0
    8: e24dd00c sub sp, sp, # 12
    c: e50b0008 str r0, [fp, # -8]
   10: e51b3008 ldr r3, [fp, # -8]
   14: e2033007 and r3, r3, # 7
   18: e2833001 add r3, r3, # 1
   1c: e1a00003 mov r0, r3
   20: e28bd000 add sp, fp, # 0
   24: e49db004 pop {fp};  (ldr fp, [sp], # 4)
   28: e12fff1e bx lr

 0000002c:
   2c: e52db004 push {fp};  (str fp, [sp, # -4]!)
   30: e28db000 add fp, sp, # 0
   34: e24dd00c sub sp, sp, # 12
   38: e50b0008 str r0, [fp, # -8]
   3c: e51b3008 ldr r3, [fp, # -8]
   40: e5933000 ldr r3, [r3]
   44: e2033007 and r3, r3, # 7
   48: e2833001 add r3, r3, # 1
   4c: e1a00003 mov r0, r3
   50: e28bd000 add sp, fp, # 0
   54: e49db004 pop {fp};  (ldr fp, [sp], # 4)
   58: e12fff1e bx lr

Optimization, -O1 - O3 gave the same result. And you still lose the instruction loading the value.

  
 00000000:
    0: e2000007 and r0, r0, # 7
    4: e2800001 add r0, r0, # 1
    8: e12fff1e bx lr

 0000000c:
    c: e5900000 ldr r0, [r0]
   10: e2000007 and r0, r0, # 7
   14: e2800001 add r0, r0, # 1
   18: e12fff1e bx lr

And this will continue as for any size of the same size that you can pass. 64-bit integers, you are still writing extra loading of instruction and memory cycles from the link to the registers to work. Any array of things that you really cannot do at a cost, can you? But a structure that you can use and which is part of the structure, link or not, will require some addressing.

  typedef struct
 {
     unsigned int a;
     unsigned int b;
     char c [4];
 } ruct;

 unsigned int fun_one (ruct a)
 {
     return ((ac [3] & 7) +1);
 }

 unsigned int fun_two (ruct * a)
 {
     return ((a-> c [3] & 7) +1);
 }

Without optimization, we start with 12 instructions. I would have to look at it more to decide if more clock cycles would burn than the other.

  00000000:
    0: e52db004 push {fp};  (str fp, [sp, # -4]!)
    4: e28db000 add fp, sp, # 0
    8: e24dd014 sub sp, sp, # 20
    c: e24b3010 sub r3, fp, # 16
   10: e8830007 stm r3, {r0, r1, r2}
   14: e55b3005 ldrb r3, [fp, # -5]
   18: e2033007 and r3, r3, # 7
   1c: e2833001 add r3, r3, # 1
   20: e1a00003 mov r0, r3
   24: e28bd000 add sp, fp, # 0
   28: e49db004 pop {fp};  (ldr fp, [sp], # 4)
   2c: e12fff1e bx lr

 00000030:
   30: e52db004 push {fp};  (str fp, [sp, # -4]!)
   34: e28db000 add fp, sp, # 0
   38: e24dd00c sub sp, sp, # 12
   3c: e50b0008 str r0, [fp, # -8]
   40: e51b3008 ldr r3, [fp, # -8]
   44: e5d3300b ldrb r3, [r3, # 11]
   48: e2033007 and r3, r3, # 7
   4c: e2833001 add r3, r3, # 1
   50: e1a00003 mov r0, r3
   54: e28bd000 add sp, fp, # 0
   58: e49db004 pop {fp};  (ldr fp, [sp], # 4)
   5c: e12fff1e bx lr

But look what happens with optimization. The structure was such that it corresponded to the registers during the passage.

  
 00000000:
    0: e24dd010 sub sp, sp, # 16
    4: e28d3004 add r3, sp, # 4
    8: e8830007 stm r3, {r0, r1, r2}
    c: e5dd100f ldrb r1, [sp, # 15]
   10: e2010007 and r0, r1, # 7
   14: e2800001 add r0, r0, # 1
   18: e28dd010 add sp, sp, # 16
   1c: e12fff1e bx lr

 00000020:
   20: e5d0100b ldrb r1, [r0, # 11]
   24: e2010007 and r0, r1, # 7
   28: e2800001 add r0, r0, # 1
   2c: e12fff1e bx lr

Unfortunately, gcc didn’t optimize this task very well, it could do a shift and one command by r3, add and bx, lr, three instructions, beating a pass by reference.

You need to know the compiler and interface, does it pass arguments to registers or is it always on the stack? If registers are used, what do they do if your arguments need more space than the reserved registers can handle, does they fill them, then use the stack, does it only use the stack and registers? Whether it holds a pointer to the memory containing the argument, passes the style of the link, but the one that the passed value is protected.

You should also look beyond the scope of individual functions regarding how much memory and register work must occur in order to prepare a function call. Passing by reference for an example structure will be one load or immediately populate one register with the address of the structure. Skipping the structure value, in the case of ARM, will be one instruction for loading three registers with the structure, but this requires three clock cycles (or 6 or 2, depending on the amba / axi bus). Other processors may cost you three instructions plus a data cycle for each register. Thus, even if gcc did a better job optimizing the example of passing through the value structure, following a link might just exit it with a clock cycle or two, but it depends a lot on how the code looks in the calling function. To really know that you need to test it by pinpointing the code and figuring out why it gets faster or slower when you configure it.

+4
source share

In the general case, if the size of a machine word (and, as a rule, the size of a pointer) is smaller than the size of an integer, then passing by reference will be faster.

For example, on a 32-bit machine, passing a uint64_t by reference will be slightly faster than passing by value, because for transmitting by value, you need to copy an integer that requires two register loads. Passing by reference includes only one register load.

Despite this, for the most part this is unlikely to lead to a noticeable difference in performance if you do not call the function like millions of times in a tight loop, in which case the function may need to be built in whenever possible.

+3
source share

If you pass a value that used only a few function calls in depth, then it would be more efficient to pass by reference-to-const-T). If this is the case, you will disclose implementation details for the sake of premature "optimization."

I suspect that in most cases you will lose significant performance due to optimizations that the compiler can no longer do (because you have a variable taken at the address and the pointer has escaped):

  • The variable cannot be in the register.
  • The variable must wait for the end of the last function in its area (i.e., it cannot be reused to store another variable).
  • A variable can change through function calls, which means that the compiler must forget everything that he could know about this between calls (for example, positive / zero).

For example (I use pointer syntax to make things more explicit, but the same is true for references):

 long long x=0,y=1; for (int i = 0; i < 10; i++) { x = f(&x); g(&x); y = f(&y); g(&y); } 

Pretty standard, but f () and g () can be annoying:

 long long f(long long * x) { static long long * old; if (old) { *old++; *x += *old; } return ++*x; } long long g(long long * x) { static long long * old; if (old == x) { abort(); } printf("%lld\n", *x); } 

You can fix some problems using long long const * (therefore functions cannot change the value, but they can still read from it ...).

You can get around them by inserting a function call inside the block and passing a link to a copy of the variable:

 { long long tmp = x; x = f(&tmp); } 
0
source share

All Articles