Even with -O0 gcc does not emit definitions for static inline functios if there is no caller. In this case, it is not actually built-in: instead, it issues an autonomous definition. Therefore, I think your disassembly of this.
Are you using a really old version of gcc? gcc 4.6.4 pushes vars in this order onto the stack, but 4.7.3 and later use a different order:
movb $1, -5(%rbp)
In your asm, they are stored in the initialization order, not the declaration, but I think this is accidental, since the order was changed using gcc 4.7. In addition, binding to initializers, such as int i=1; , does not change the placement order, so it completely torpedoes this theory.
Remember that gcc is designed around a series of conversions from source to asm, so -O0 does not mean "no optimization" . You should think of -O0 as what -O3 usually does. There is no option that tries to make as literal as possible a translation from the source to asm.
Once gcc decides which order to allocate space for them:
char at rbp-1 : This is the first available location that char can contain. If storage requires another char , it can go to rbp-2 .
int at rbp-8 : since 4 bytes from rbp-1 to rbp-4 are not free, the next available naturally aligned location is rbp-8 .
Or with gcc 4.7 and later, -4 is the first available spot for int, and -5 is the next byte below this.
RE: space saving:
It is true that placing char at -5 makes the least affected address %rsp-5 instead of %rsp-8 , but it does not save anything.
The stack pointer is aligned 16B in the AMD64 SysV ABI. (Technically, %rsp+8 (the beginning of the stack arguments) is aligned by writing the function before you press anything.) The only way for %rbp-8 to touch a new page or cache line for which %rbp-5 not suitable for the stack must be less than 4B-aligned. This is extremely unlikely even in 32-bit code.
As for how many stacks are “distributed” or “owned” by a function: in AMD64 SysV ABI, the function “owns” a 128B red zone below %rsp (This size was chosen because a single-byte offset can reach -128 ) . Signal handlers and any other asynchronous users of the user space stack avoid flushing the red zone, so the function can write to memory below %rsp without decreasing %rsp . Therefore, from this point of view it does not matter how much of the red zone we use; the chances of the signal handler exiting the stack are unaffected.
In 32-bit code, where there is no red zone, for any order, gcc reserves space on the stack with sub $16, %esp . (try -m32 on godbolt). So, again, it doesn’t matter if we use 5 or 8 bytes, because we reserve in units of 16.
When there are many char and int variables, gcc packs the char into 4B groups instead of losing the space for fragmentation, even when the declarations mix together:
void many_vars(void) { char tmp = 1; int i=1; char t2 = 2; int i2 = 2; char t3 = 3; int i3 = 3; char t4 = 4; }
with gcc 4.6.4 -O0 -fverbose-asm , which helps mark which storage is for which variable, so ASM compiler output for disassembly is preferred:
pushq %rbp # movq %rsp, %rbp #, movb $1, -4(%rbp) #, tmp movl $1, -16(%rbp) #, i movb $2, -3(%rbp) #, t2 movl $2, -12(%rbp) #, i2 movb $3, -2(%rbp) #, t3 movl $3, -8(%rbp) #, i3 movb $4, -1(%rbp) #, t4 popq %rbp # ret
I think the variables go either in direct or in the reverse order of declaration, depending on the version of gcc, at -O0 .
I made a version of your read_array function that works with optimization:
// assumes that size is non-zero. Use a while() instead of do{}while() if you want extra code to check for that case. void read_array_good(const char* array, size_t size) { const volatile char *vp = array; do { (void) *vp; // this counts as accessing the volatile memory, with gcc/clang at least vp += CACHE_LINE_SIZE/sizeof(vp[0]); } while (vp < array+size); }
The following compiles: gcc 5.3 -O3 -march = haswell :
addq %rdi, %rsi
Casting an expression to void is a canonical way of telling the compiler that a value is being used. for example, to suppress warnings with an unused variable, you can write (void)my_unused_var; .
For gcc and clang, doing this using the volatile pointer spread creates memory access, without the need for the tmp variable. The C standard is very non-specific with respect to what constitutes access to something that is volatile , so this is probably not entirely portable. Another way is to xor values ​​you read in the battery and then save them globally. Until you use the optimization of entire programs, the compiler does not know that nothing reads global, so it cannot optimize calculations.
See the vmtouch source code for an example of this second technique. (In fact, it uses a global variable for the battery, which makes clumsy code. Of course, it doesn’t make much difference, as it concerns pages, not just cache lines, so it very quickly eliminates the disadvantages of TLB misses and page failures, even with memory, change-record in the dependency chain associated with the loop.)
I tried and could not write something that gcc or clang will compile into a function without a prolog (which assumes that size initially non-zero):
read_array_handtuned: .L0 mov al, dword [rdi] ; maybe not ideal on AMD: partial-reg writes have a false dep on old value. Hopefully the load can still start, and just the merging is serialized? add rdi, 64 sub rsi, 64 jae .L0 ; or ja, depending on what semantics you want
I can get this if the loop termination condition is je , in the loop, like do { ... } while( size -= CL_SIZE ); But I can’t get the optimal code from the compilers to search for an unsigned loan when subtracting. They want to subtract and then cmp -64/jb to detect the lower stream. It's not that hard to get compilers to check the carry flag after adding to detect the carry : /
It's also easy to get compilers to do a 4-insn loop, but not without a prolog. for example, calculate the ending pointer (array + size) and increase the pointer until it becomes greater or equal.