If you are thinking about an automobile assembly line, you hear things like the X number of cars going out of the line per day. This does not mean that the source materials began at the beginning of the line, and the number X completed the entire cycle in one day. Who knows, this probably can’t, but it can take several days for the car, starting to the end, this is the point of the conveyor. Imagine, though, if for some reason you had a change in production, and you basically had to dump all the cars on the line and break them or save parts of them to put another car at another time. It will take some time to fill this assembly line and return to X number of cars per day.
The processor pipeline in the processor works the same way, there are hundreds of steps in the pipeline, but the concept is the same to support one or more instructions for the speed of the clock cycle (X number of cars per day), you must maintain the smooth operation of this pipeline. So you are a prefetch that burns a memory loop, which is usually slow, but layers of caching help. Decodes, executes other hours, executes, can take many esp hours on CISC, like x86. When you execute a branch, on most processors you should discard the instruction in execution and prefetching, basically 2 / 3rds of your pipeline, if you are thinking of a general simplified pipeline. Then you need to wait for this clock to retrieve and decode before you return to smooth execution. In addition, sampling, by definition, which is not the next instruction, is a percentage of time more than a dash, and a certain percentage of time, which means sampling from memory or a higher level cache, which is even more hours than if you were performing linearly. Another common solution is that some processors claim that no matter what the command is after the branch instruction, and sometimes two instructions after the branch instruction is executed. This way you execute when you clean the pipe, a good compiler organizes the instructions so that after each branch there is something else. A sneaky way is simply to put a nop or two after each branch, creating another performance hit, but for this platform most people will be used. The third way is what ARM does, with conditional execution. In short, advanced branches that are not all so unusual, instead of saying that a branch, if a condition, you mark several instructions that you try to pass with execution, if not a condition, they go to decoding and execute and execute as nops , and pipe continues to move. ARM relies on traditional flush and replenishment for longer or backward branches.
In the old x86 manuals (8088/86), as well as in other equally old processor manuals for other processors, as well as in the microcontroller manuals (new and old), clock cycles will be published for execution for each instruction. And for branch instructions, say, add x the number of hours if a branch happens. Your modern x86 processors and even ARM and other processors designed to run Windows or Linux or other (bulky and slow) operating systems do not bother, they often just say that it executes one instruction per clock or speaks of mips to meghertz or the like things and it’s not necessary to have a table of hours for instructions. You only guess, and remember that it is like one car per day, its last frame of execution, and not other hours. People with a microcontroller, in particular, are dealing with more than one clock per instruction, but should be more aware of the execution time than the average desktop application. Look at the specifications for some of them Microchip PIC (not PIC32, which are mips), msp430, specifically 8051, although they are made or have been made by many different companies, their time specifications vary greatly.
On the bottom line, for desktop applications or even kernel drivers in the operating system, the compiler is not so efficient, and the operating system adds much more overhead that you are unlikely to notice time savings. Switch to the microcontroller and turn on too many branches, and your code will be 2 or 3 times slower. Even with a compiler, not an assembler. Provided with a compiler (not writing in assembler) can / will make your code 2 or 3 times slower, you must balance development, maintenance and portability with performance.