This does not correspond to point 2 ( return a == d || b == d || c == d; at the same speed as return false ). This is another interesting question, as it should compile several instructions with instructions like uop-cache.
The fact that the 32-aligned version is faster is strange to me because [Intel Guide says to level to 32]
This optimization advice tip is a very general guide and certainly does not mean that it never helps again. This is usually not the case, and padding to 32 will rather hurt than help. (I-cache misses, ITLB omissions, and more bytes of code to load from disk).
In fact, 16B alignment is rarely required, especially on processors with a uop cache. For a small loop that can be executed from a loop buffer, alignment usually does not matter.
16B is still not bad, as a broad recommendation, but it does not tell you everything you need to know in order to understand one specific case on several specific processors.
Compilers usually by default align branch branches and function entry points, but usually do not align other branch targets. The cost of executing NOPs (and code swelling) is often greater than the likely cost of a non-configured branch goal without a loop.
Code alignment has some direct and some indirect effects. Direct effects include the uop cache in the Intel SnB family. For example, see Aligning Branches for Loops Using Microcoded Instructions for Intel SnB Processors .
Another section of Intel Optimization Guide details how the uop cache works:
2.3.2.2 Decoded ICache :
- All microoperations in the path (the uop cache line) are commands that are statically contiguous in the code and have their EIP within the same aligned 32-byte region. (I think this means indicating that it goes through the border, goes to the uop cache for the block containing its beginning, not the end. The stretching instructions should go somewhere, and the destination address of the branch that will start the instruction is the beginning of insn, so itโs most useful to put it in a string for this block).
- The multi micro-op command cannot be split between paths.
- The instruction, which includes MSROM, uses all the way.
- Up to two branches are allowed.
- A pair of macro-configured instructions are stored as one micro-operator.
See also Agner Fog Microargate Guide . He adds:
- An unconditional jump or call always ends with the ฮผop cache line
- many other things that are probably not relevant here.
Also, if your code is not suitable for the uop cache, it cannot work from the loop buffer.
Indirect alignment effects include:
- Larger / smaller code size (missing L1I cache, TLB). Not relevant to your test.
- which associates aliases with each other in the BTB (destination buffer buffer).
If I remove volatile from one.cpp, the code will become slower. Why is this?
Larger instructions push the last instruction in a loop across border 32B:
59e: 83 eb 01 sub ebx,0x1 5a1: 75 dd jne 580 <main+0x20>
So, if you are not working from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.
If sub / jne are macro fuses, this may not apply. And I think that only crossing the 64B border will break the macro merge.
In addition, these are not real addresses. Have you checked which addresses after linking? There may be a 64B border after layout if the text section has alignment less than 64B.
Sorry, I have not actually tested this to say more about this particular case. The fact is that when you encounter an external interface, for example, call / ret inside a tight loop, alignment becomes important and can become extremely complicated . Border crossing or not for all future instructions. Do not expect this to be easy. If you read my other answers, you will realize that Iโm usually not the kind of person to say โitโs too difficult to fully explain,โ but alignment may be like that.
See also Alignment of code in one object file affects the performance of a function in another object file.
In your case, make sure the tiny features are built-in. Use connection time optimization if your codebase has any important tiny functions in separate .c files and not in .h where they can be embedded. Or modify your code to put them in .h .