Why is it useful to avoid branching commands where possible?

I often read that it’s bad from a perspective, that branching, sort of at the assembly instruction level, is bad. But I really did not see why this is so. So why?

+7
source share
5 answers

Most modern prefetch processors and even speculatively execute them before the code stream reaches this instruction. The presence of a branch means that suddenly two different instructions appear, which may be the next instruction. There are at least three possible ways to interact with prefetching:

  • Instructions after the branches are not predefined. The pipeline becomes empty, and the processor must wait until the next command is retrieved at the last moment, which will degrade performance.
  • The processor can guess which branch will be accepted ( branch prediction ), as well as prefetch and follow the corresponding instruction. If he realizes the wrong branch, he will have to abandon the work done and wait until the correct instruction is received.
  • A processor can retrieve and execute both branches, and then discard the results from a branch that has not been executed.

Depending on the processor and specific code, a branch may or may not have a significant impact on performance compared to equivalent code without a branch. If the processor executing the code uses branch prediction (most of them) and mostly correctly guesses for a certain part of the code, this may not cause a significant performance impact. On the other hand, if he is mostly guessing wrong, it can slow down a lot.

It is difficult to predict for a particular piece of code whether deleting a branch will significantly speed up code execution. With microoptimization, it is best to measure the performance of both approaches rather than guessing.

+12
source

This is bad because it interferes with prefetch instructions . Modern processors can start loading the next bytes of instructions, while still processing the first, to work faster. When a branch occurs, it is necessary to discard the “next command” that was previously selected, which scares off time. Inside a stiff loop or the like, these missed prefets can stack.

+2
source

since the processor does not know what commands it is supposed to prefetched for execution, if you give it opportunities. in case the branch goes in a different way than expected, it should clear the command pipeline, since these loaded instructions are now wrong, and this makes a couple of loops slower ...

+1
source

In addition to problems with prefetching, if you jump, you are not doing other work ...

+1
source

If you are thinking about an automobile assembly line, you hear things like the X number of cars going out of the line per day. This does not mean that the source materials began at the beginning of the line, and the number X completed the entire cycle in one day. Who knows, this probably can’t, but it can take several days for the car, starting to the end, this is the point of the conveyor. Imagine, though, if for some reason you had a change in production, and you basically had to dump all the cars on the line and break them or save parts of them to put another car at another time. It will take some time to fill this assembly line and return to X number of cars per day.

The processor pipeline in the processor works the same way, there are hundreds of steps in the pipeline, but the concept is the same to support one or more instructions for the speed of the clock cycle (X number of cars per day), you must maintain the smooth operation of this pipeline. So you are a prefetch that burns a memory loop, which is usually slow, but layers of caching help. Decodes, executes other hours, executes, can take many esp hours on CISC, like x86. When you execute a branch, on most processors you should discard the instruction in execution and prefetching, basically 2 / 3rds of your pipeline, if you are thinking of a general simplified pipeline. Then you need to wait for this clock to retrieve and decode before you return to smooth execution. In addition, sampling, by definition, which is not the next instruction, is a percentage of time more than a dash, and a certain percentage of time, which means sampling from memory or a higher level cache, which is even more hours than if you were performing linearly. Another common solution is that some processors claim that no matter what the command is after the branch instruction, and sometimes two instructions after the branch instruction is executed. This way you execute when you clean the pipe, a good compiler organizes the instructions so that after each branch there is something else. A sneaky way is simply to put a nop or two after each branch, creating another performance hit, but for this platform most people will be used. The third way is what ARM does, with conditional execution. In short, advanced branches that are not all so unusual, instead of saying that a branch, if a condition, you mark several instructions that you try to pass with execution, if not a condition, they go to decoding and execute and execute as nops , and pipe continues to move. ARM relies on traditional flush and replenishment for longer or backward branches.

In the old x86 manuals (8088/86), as well as in other equally old processor manuals for other processors, as well as in the microcontroller manuals (new and old), clock cycles will be published for execution for each instruction. And for branch instructions, say, add x the number of hours if a branch happens. Your modern x86 processors and even ARM and other processors designed to run Windows or Linux or other (bulky and slow) operating systems do not bother, they often just say that it executes one instruction per clock or speaks of mips to meghertz or the like things and it’s not necessary to have a table of hours for instructions. You only guess, and remember that it is like one car per day, its last frame of execution, and not other hours. People with a microcontroller, in particular, are dealing with more than one clock per instruction, but should be more aware of the execution time than the average desktop application. Look at the specifications for some of them Microchip PIC (not PIC32, which are mips), msp430, specifically 8051, although they are made or have been made by many different companies, their time specifications vary greatly.

On the bottom line, for desktop applications or even kernel drivers in the operating system, the compiler is not so efficient, and the operating system adds much more overhead that you are unlikely to notice time savings. Switch to the microcontroller and turn on too many branches, and your code will be 2 or 3 times slower. Even with a compiler, not an assembler. Provided with a compiler (not writing in assembler) can / will make your code 2 or 3 times slower, you must balance development, maintenance and portability with performance.

0
source

All Articles