Why is gcc reordering a local variable in a function?

I wrote a C program that just read / write a large array. I compiled the program using the gcc -O0 program.c -o program Out of curiosity, I am smoothing the C program using the objdump -S command.

The code and assembly of the read_array and write_array are attached at the end of this question.

I am trying to understand how gcc compiles a function. I used // to add my comments and questions.

Take one part of the beginning of the write_array() function assembly code

  4008c1: 48 89 7d e8 mov %rdi,-0x18(%rbp) // this is the first parameter of the fuction 4008c5: 48 89 75 e0 mov %rsi,-0x20(%rbp) // this is the second parameter of the fuction 4008c9: c6 45 ff 01 movb $0x1,-0x1(%rbp) // comparing with the source code, I think this is the `char tmp` variable 4008cd: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp) // this should be the `int i` variable. 

I do not understand:

1) char tmp is obviously defined after int i in the write_array function. Why is gcc reordering the memory locations of these two local variables?

2) From the offset, int i is at -0x8(%rbp) and char tmp is at -0x1(%rbp) , which indicates that the variable int i takes up 7 bytes? This is rather strange, because int i should be 4 bytes on an x86-64 machine. Is not it? My guess is that gcc is trying to do some setup?

3) I found that the choice of gcc optimization is quite interesting. Are there any good docs / books that explain how gcc works? (The third question may be off topic, and if you think so, just ignore it. I'm just trying to find out if there is some shortcut to learn the basic mechanisms used by gcc to compile. :-))

The following is a snippet of the function code:

 #define CACHE_LINE_SIZE 64 static inline void read_array(char* array, long size) { int i; char tmp; for ( i = 0; i < size; i+= CACHE_LINE_SIZE ) { tmp = array[i]; } return; } static inline void write_array(char* array, long size) { int i; char tmp = 1; for ( i = 0; i < size; i+= CACHE_LINE_SIZE ) { array[i] = tmp; } return; } 

The following is a fragment of the disassembled code for write_array , from gcc -O0:

 00000000004008bd <write_array>: 4008bd: 55 push %rbp 4008be: 48 89 e5 mov %rsp,%rbp 4008c1: 48 89 7d e8 mov %rdi,-0x18(%rbp) 4008c5: 48 89 75 e0 mov %rsi,-0x20(%rbp) 4008c9: c6 45 ff 01 movb $0x1,-0x1(%rbp) 4008cd: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp) 4008d4: eb 13 jmp 4008e9 <write_array+0x2c> 4008d6: 8b 45 f8 mov -0x8(%rbp),%eax 4008d9: 48 98 cltq 4008db: 48 03 45 e8 add -0x18(%rbp),%rax 4008df: 0f b6 55 ff movzbl -0x1(%rbp),%edx 4008e3: 88 10 mov %dl,(%rax) 4008e5: 83 45 f8 40 addl $0x40,-0x8(%rbp) 4008e9: 8b 45 f8 mov -0x8(%rbp),%eax 4008ec: 48 98 cltq 4008ee: 48 3b 45 e0 cmp -0x20(%rbp),%rax 4008f2: 7c e2 jl 4008d6 <write_array+0x19> 4008f4: 5d pop %rbp 4008f5: c3 retq 
+2
source share
2 answers

Even with -O0 gcc does not emit definitions for static inline functios if there is no caller. In this case, it is not actually built-in: instead, it issues an autonomous definition. Therefore, I think your disassembly of this.


Are you using a really old version of gcc? gcc 4.6.4 pushes vars in this order onto the stack, but 4.7.3 and later use a different order:

  movb $1, -5(%rbp) #, tmp movl $0, -4(%rbp) #, i 

In your asm, they are stored in the initialization order, not the declaration, but I think this is accidental, since the order was changed using gcc 4.7. In addition, binding to initializers, such as int i=1; , does not change the placement order, so it completely torpedoes this theory.

Remember that gcc is designed around a series of conversions from source to asm, so -O0 does not mean "no optimization" . You should think of -O0 as what -O3 usually does. There is no option that tries to make as literal as possible a translation from the source to asm.

Once gcc decides which order to allocate space for them:

  • char at rbp-1 : This is the first available location that char can contain. If storage requires another char , it can go to rbp-2 .

  • int at rbp-8 : since 4 bytes from rbp-1 to rbp-4 are not free, the next available naturally aligned location is rbp-8 .

Or with gcc 4.7 and later, -4 is the first available spot for int, and -5 is the next byte below this.


RE: space saving:

It is true that placing char at -5 makes the least affected address %rsp-5 instead of %rsp-8 , but it does not save anything.

The stack pointer is aligned 16B in the AMD64 SysV ABI. (Technically, %rsp+8 (the beginning of the stack arguments) is aligned by writing the function before you press anything.) The only way for %rbp-8 to touch a new page or cache line for which %rbp-5 not suitable for the stack must be less than 4B-aligned. This is extremely unlikely even in 32-bit code.

As for how many stacks are “distributed” or “owned” by a function: in AMD64 SysV ABI, the function “owns” a 128B red zone below %rsp (This size was chosen because a single-byte offset can reach -128 ) . Signal handlers and any other asynchronous users of the user space stack avoid flushing the red zone, so the function can write to memory below %rsp without decreasing %rsp . Therefore, from this point of view it does not matter how much of the red zone we use; the chances of the signal handler exiting the stack are unaffected.

In 32-bit code, where there is no red zone, for any order, gcc reserves space on the stack with sub $16, %esp . (try -m32 on godbolt). So, again, it doesn’t matter if we use 5 or 8 bytes, because we reserve in units of 16.

When there are many char and int variables, gcc packs the char into 4B groups instead of losing the space for fragmentation, even when the declarations mix together:

 void many_vars(void) { char tmp = 1; int i=1; char t2 = 2; int i2 = 2; char t3 = 3; int i3 = 3; char t4 = 4; } 

with gcc 4.6.4 -O0 -fverbose-asm , which helps mark which storage is for which variable, so ASM compiler output for disassembly is preferred:

  pushq %rbp # movq %rsp, %rbp #, movb $1, -4(%rbp) #, tmp movl $1, -16(%rbp) #, i movb $2, -3(%rbp) #, t2 movl $2, -12(%rbp) #, i2 movb $3, -2(%rbp) #, t3 movl $3, -8(%rbp) #, i3 movb $4, -1(%rbp) #, t4 popq %rbp # ret 

I think the variables go either in direct or in the reverse order of declaration, depending on the version of gcc, at -O0 .


I made a version of your read_array function that works with optimization:

 // assumes that size is non-zero. Use a while() instead of do{}while() if you want extra code to check for that case. void read_array_good(const char* array, size_t size) { const volatile char *vp = array; do { (void) *vp; // this counts as accessing the volatile memory, with gcc/clang at least vp += CACHE_LINE_SIZE/sizeof(vp[0]); } while (vp < array+size); } 

The following compiles: gcc 5.3 -O3 -march = haswell :

  addq %rdi, %rsi # array, D.2434 .L11: movzbl (%rdi), %eax # MEM[(const char *)array_1], D.2433 addq $64, %rdi #, array cmpq %rsi, %rdi # D.2434, array jb .L11 #, ret 

Casting an expression to void is a canonical way of telling the compiler that a value is being used. for example, to suppress warnings with an unused variable, you can write (void)my_unused_var; .

For gcc and clang, doing this using the volatile pointer spread creates memory access, without the need for the tmp variable. The C standard is very non-specific with respect to what constitutes access to something that is volatile , so this is probably not entirely portable. Another way is to xor values ​​you read in the battery and then save them globally. Until you use the optimization of entire programs, the compiler does not know that nothing reads global, so it cannot optimize calculations.

See the vmtouch source code for an example of this second technique. (In fact, it uses a global variable for the battery, which makes clumsy code. Of course, it doesn’t make much difference, as it concerns pages, not just cache lines, so it very quickly eliminates the disadvantages of TLB misses and page failures, even with memory, change-record in the dependency chain associated with the loop.)


I tried and could not write something that gcc or clang will compile into a function without a prolog (which assumes that size initially non-zero):

 read_array_handtuned: .L0 mov al, dword [rdi] ; maybe not ideal on AMD: partial-reg writes have a false dep on old value. Hopefully the load can still start, and just the merging is serialized? add rdi, 64 sub rsi, 64 jae .L0 ; or ja, depending on what semantics you want 

Godbolt compiler browser with all my attempts and trial versions

I can get this if the loop termination condition is je , in the loop, like do { ... } while( size -= CL_SIZE ); But I can’t get the optimal code from the compilers to search for an unsigned loan when subtracting. They want to subtract and then cmp -64/jb to detect the lower stream. It's not that hard to get compilers to check the carry flag after adding to detect the carry : /

It's also easy to get compilers to do a 4-insn loop, but not without a prolog. for example, calculate the ending pointer (array + size) and increase the pointer until it becomes greater or equal.

+3
source

For a local variable stored on the stack, the order of the addresses depends on the direction in which the stack grows. You can refer to Do I stack up or down? for more information.

This is rather strange, because int I should be 4 bytes on an x86-64 machine. Is not it?

If my memory serves me correctly, the int size on the x86-64 machine is 8. you can confirm this by writing a test application for printing sizeof(int) .

-2
source

All Articles