The fact is that modern processors are complex. All completed instructions will interact with each other in complex and interesting ways. Thanks for the "other person" for sending the code.
Both OP and โthis other guyโ seem to have found that a short cycle takes 11 cycles and a long cycle takes 9 cycles. For a long cycle, 9 cycles is a lot of time, although there are many operations. There should be some stall for a short loop, caused by the fact that it was so short, and just adding nop makes the loop long enough to avoid a break.
One thing that arises if we look at the code:
0x00000000004005af <+50>: addq $0x1,-0x20(%rbp) 0x00000000004005b4 <+55>: cmpq $0x7fffffff,-0x20(%rbp) 0x00000000004005bc <+63>: jb 0x4005af <main+50>
We read i and write it back ( addq ). We read it again and compare ( cmpq ). And then we go in cycles. But the loop uses branch prediction. Therefore, while addq is addq , the processor is not really sure that it is allowed to write on i (since branch prediction may be wrong).
Then we compare with i . The processor will try to avoid reading i from memory, because reading takes a lot of time. Instead, a certain amount of hardware will remember that we just wrote i , adding it to it, and instead of reading i , the cmpq command cmpq data from the storage instruction. Unfortunately, we are not sure if this actually happened in i or not! Thus, a stall could be introduced here.
The problem here is that the conditional jump, addq , which leads to conditional storage, and cmpq , which are not sure where to get the data, are very close to each other. They are unusually close to each other. Maybe they are so close to each other that at this moment the processor cannot understand whether to take i from the store's instructions or read it from memory. And it reads it from a memory that is slower because it has to wait for the store to finish. And adding just one nop gives the processor enough time.
Usually you think that there is RAM, and there is a cache. On a modern Intel processor, read memory can be read from (from the slowest to the fastest):
- Memory (RAM)
- L3 cache (optional)
- L2 cache
- L1 cache
- Previous store instruction, which has not yet been written to the L1 cache.
What the processor does internally in a short, slow cycle:
- Read
i from cache L1 - Add 1 to
i - Write
i to cache L1 - Wait until
i is written to cache L1 - Read
i from cache L1 - Compare
i with INT_MAX - Refer to (1) if it is less.
In a long, fast cycle, the processor performs:
- Many things
- Read
i from cache L1 - Add 1 to
i - Make a store instruction that will write
i to L1 cache - Read
i directly from the store instruction without touching the L1 cache - Compare
i with INT_MAX - Refer to (1) if it is less.