We can see what exactly GCC does under the hood, compiling both cases with -S:
g++-4.6 -std=c++0x test.cc -S -fverbose-asm
And then using diff to compare the outputs:
diff -rNu move.s ret.s |c++filt --- move.s 2015-05-21 14:00:49.097524035 +0100 +++ ret.s 2015-05-21 14:00:40.021510019 +0100 @@ -79,23 +79,13 @@ .cfi_offset 5, -8 movl %esp, %ebp #, .cfi_def_cfa_register 5 - subl $2097176, %esp #, - leal -2097160(%ebp), %eax #, tmp60 + subl $24, %esp #, + movl 8(%ebp), %eax # .result_ptr, tmp59 movl $2097152, %edx #, tmp61 movl %edx, 8(%esp) # tmp61, movl $0, 4(%esp) #, movl %eax, (%esp) # tmp60, call memset # - leal -2097160(%ebp), %eax #, tmp64 - movl %eax, (%esp) # tmp64, - call std::remove_reference<std::bitset<16777215u>&>::type&& std::move<std::bitset<16777215u>&>(std::bitset<16777215u>&) # - movl %eax, %edx #, D.21547 - movl 8(%ebp), %eax # .result_ptr, tmp65 - movl $2097152, %ecx #, tmp68 - movl %ecx, 8(%esp) # tmp68, - movl %edx, 4(%esp) # tmp67, - movl %eax, (%esp) # tmp66, - call memcpy # movl 8(%ebp), %eax # .result_ptr, leave .cfi_restore 5
(Lines marked with a + sign exist only in case of return by value, lines with - exist only in case of movement).
In this case, there is much more manipulation of the stack pointer (and some very large numbers). It is imperative that then the memcpy call ends, which copies the results back onto the stack.
My analysis of this issue is that for the case of return by value, another optimization actually occurs, which means that the unused temporary internal main function is completely excluded for the case of return by value, but not for the case of movement.
We can confirm that in the future, after conducting the same analysis in the case of a return by value with -O0, turning off all optimizations and seeing what happens:
diff -Nru noopt.s ret.s --- noopt.s 2015-05-21 14:06:14.798028762 +0100 +++ ret.s 2015-05-21 14:00:40.021510019 +0100 @@ -3,7 +3,7 @@ # compiled by GNU C version 4.6.4, GMP version 5.1.3, MPFR version 3.1.2-p3, MPC version 1.0.1 # GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 # options passed: -imultilib . -imultiarch i386-linux-gnu -D_GNU_SOURCE -# test.cc -mtune=generic -march=i686 -O0 -std=c++0x -fverbose-asm +# test.cc -mtune=generic -march=i686 -std=c++0x -fverbose-asm # -fstack-protector # options enabled: -fasynchronous-unwind-tables -fauto-inc-dec # -fbranch-count-reg -fcommon -fdelete-null-pointer-checks -fdwarf2-cfi-asm @@ -79,23 +79,13 @@ .cfi_offset 5, -8 movl %esp, %ebp #, .cfi_def_cfa_register 5 - subl $2097176, %esp #, - leal -2097160(%ebp), %eax #, tmp60 + subl $24, %esp #, + movl 8(%ebp), %eax # .result_ptr, tmp59 movl $2097152, %edx #, tmp61 movl %edx, 8(%esp) # tmp61, movl $0, 4(%esp) #, movl %eax, (%esp) # tmp60, call memset # - leal -2097160(%ebp), %eax #, tmp64 - movl %eax, (%esp) # tmp64, - call _ZSt4moveIRSt6bitsetILj16777215EEEONSt16remove_referenceIT_E4typeEOS4_ # - movl %eax, %edx #, D.21547 - movl 8(%ebp), %eax # .result_ptr, tmp65 - movl $2097152, %ecx #, tmp68 - movl %ecx, 8(%esp) # tmp68, - movl %edx, 4(%esp) # tmp67, - movl %eax, (%esp) # tmp66, - call memcpy # movl 8(%ebp), %eax # .result_ptr, leave .cfi_restore 5
Again, in the case of returning by value, the same manipulation of the stack pointer and copying with optimization are disabled. So it looks like in both cases you have a stack overflow, but in the case of a return by value, your test case is not enough to actually observe it due to other optimizations.
Solution: allocate a heap or get a larger stack using pthread_attr_setstacksize or clone on Linux.