This is a known compiler limitation; see my comments on the question. IDK why it exists; it may be difficult for compilers to decide what they can do without spilling when they have not finished saving regs.
Pulling an early check into a wrapper is often useful when it's small enough to embed.
It seems like modern gcc can really get around this compiler limitation.
Using your example in the Godbolt compiler explorer, adding a second caller is enough to even get gcc6.1-O2 to separate this function for you so that it can embed an early output into the second caller and into square() visibility (ending in jmp square(int*, int*) [clone .part.3] if the previous return path fails).
code on Godbolt , note I added -std=gnu++14 , which is required for clang to compile your code.
void square_inlinewrapper(int* a, int* b) {
square() itself compiles to the same thing, invoking a private clone that has the bulk of the code. Recursive calls from within the clone call the wrapper function, so they donβt do the extra push / pop work when itβs not needed.
Even gcc7 does not do this when there is no other caller, even with -O3. It still converts one of the recursive calls into a loop, and the other just calls a big function again.
Clang 3.9 and icc17 also do not clone the function, so you must manually write the built-in shell (and change the main element of the function to use it for recursive calls, if verification is needed there).
You might want to name the wrapper square and rename only the main part to a private name (for example, static void square_impl ).
Peter Cordes
source share