I saw a lot of people complaining about the -O3 option:
GCC: the program does not work with the -O3 compilation option
Floating-point issue provided by David Hamman
I check the manual from GCC:
-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions and -frename-registers options.
And I also confirmed the code to make sure that the two options are just the two optimizations included in -O3:
if (optimize >= 3){ flag_inline_functions = 1; flag_rename_registers = 1; }
For these two optimizations:
-finiline-functions is useful in some cases (mainly with C ++), because it allows you to determine the size of built-in functions (600 by default) with -finline-limit. The compiler may report an out-of-memory error when a high limit is set.
-frename-registers tries to avoid false dependencies in the planned code using the registers left after register allocation. This optimization will benefit most from processors with a large number of registers.
For built-in functions, although this can reduce the number of function calls, it can lead to large binary files, so -finline-functions can introduce severe penalties in the cache and become even slower than -O2. I think that the penalty for the cache does not depend only on the program itself.
For renaming-registers, I don't think this will have any positive effect on the cisc architecture, for example x86.
My question consists of 2.5 parts:
[Answerd] 1. Am I right in saying that a program can run faster with the -O3 option, depends on the underlying platform / architecture?
EDIT: The first part is confirmed as true. David Hamman also argues that we must be very careful about how optimization operations and floating point operations interact on machines with advanced floating point points, such as Intel and AMD.
2. When can I confidently use the -O3 option? I believe that these two optimizations, especially the rename registers, may lead to different behavior from -O0 / O2. I saw that some programs compiled with -O3 were broken at runtime, is it deterministic? If I run the executable file once without any failure, does it mean that it is safe to use -O3?
EDIT: determinism has nothing to do with optimization, it is a multithreading issue. However, for a multi-threaded program, it is unsafe to use -O3 when we run the executable once without errors. David Hamman shows that optimizing O3 for floating point operations may violate a strict weak ordering for comparison. Is there any other concern we need to take care of when we want to use the -O3 option?
[Answer] 3. If the answer to the 1st question is yes, then when changing the target platform or in a distributed system with different machines, I may need to change between -O3 and -O2. Are there any general ways to decide if I can improve performance with -O3? For example, more registers, short built-in functions, etc.
EDIT: Luen replied to the third part of the answer, because "the variety of platforms makes it impossible for a general discussion about this problem." When evaluating performance gains at -O3, we should try it with both and compare our code to see which ones are faster.