Is branch prediction incorrect for the entire pipeline, even for a very short if-statement body?

Everything I read seems to indicate that incorrect branch prediction always leads to a red pipeline, which means a lot of wasted cycles. I have never heard anyone mention exceptions for short if conditions.

It seems that in some cases this would be very wasteful. For example, suppose you have a single if statement with a very simple body that compiles up to 1 processor instruction. The if clause will be compiled into a conditional jump forward with a single statement. If the CPU predicts that the branch will not be accepted, then it will start executing the if-body statement and can immediately start executing the following instructions. Now that the evaluation of the if-state has reached the end of the pipeline, which may be, say, 12 cycles later, the CPU now knows whether this prediction was right or wrong. If this is incorrectly predicted, and the industry was actually taken, then the CPU really should only cancel 1 instruction from the pipeline (the one that is in the if-case). However, if he cleans the entire conveyor, then all the work that was performed in the following instructions was also wasted, and it will have to be repeated without any reason. This is a lot of wasted cycles on deep pipelined architecture.

So, do modern processors have any mechanism to drop only a few instructions that are inside a short if-body? Or does it really clean the entire pipeline? If this is the latter, then I believe that using a conditional branch instruction will have better performance. In any case, does anyone know if modern compilers can convert short if statements to cmov statements?

+7
performance branch-prediction branch x86 cpu-architecture
source share
4 answers

Most general-purpose processors clean the pipeline from incorrect branch prediction. The negative influence of conditional branches motivated proposals for vigorous execution (where both paths are executed and the correct path chosen later) and dynamic prediction (where instructions are given in the shadow of the branch) in addition to extensive research on branch prediction (as well as other methods). (The page labeled Smotherman awaiting execution contains some details and links. I would add Hyesoon Kim et al. “Desired branches: combining conditional branching and predication for Adaptive Predicated Execution, 2005, as an important article.)

IBM POWER7 appears to be the first core processor to implement anything more complex than prefetching an alternative path (i.e., impatient sampling), and it only handles one instance of a command. (POWER7 uses a confidence rating of the branch prediction to choose whether to predict or use the prediction.)

Expected implementation has the obvious problem of resource use explosion. Even with selective zeal based on confidence in branch prediction, depth of speculation, and availability of resources (information is available for the front-end), it can be more effective to think more deeply along one path. Finding the junction points of multiple paths and eliminating excessive redundant calculations can also complicate the task. (Ideally, management-independent operations were performed only once, and the integration and data flow would be optimized, but such an optimization adds complexity.)

For a deep pipelined processor in order, it might seem attractive to predict short branches ahead if they are not accepted, and only dump back in the pipeline instructions that target the selected branch when the actual branch. If only one such branch is allowed at a time (other branches use prediction), adding one bit to each command can control whether it will be converted to nop or executed. (If only the branching of one command is handled, which allows several branches in the pipeline to not be particularly complicated.)

It will be like delayed cancellation delay intervals. MIPS has Branch Likely instructions that are canceled if not accepted, and they are marked as deprecated in version 2.62. Although some of the rationale for this seems to separate the implementation from the interface and the desire to restore the command encoding space, this solution also hints that the concept has some problems.

If this was done for all short branches along, it would drop the commands when the branch was correctly predicted as taken. (Note that this penalty may be less if the received branches always experience a delay in redirecting the sample, which would be more likely using caching of several instruction cycles in the processor processed by the pipeline. In this case, the selection has the same performance as correctly predicted received branch. However, it can be argued that a special processor case of such short received branches to minimize such sample bubbles.)

As an example, consider a scalar pipeline (instructions without branching per cycle equal to 1.0) with a resolution of branching at the end of the eighth stage and the absence of a penalty of redirection of redirection with correctly predicted received branches that control branches with one instruction. Suppose that for such short front branches (unforeseen in directions), 75% accuracy of branch prediction (2% of instructions adopted in 30% of cases) and 93% accuracy for other branches (18% of instructions). Eight cycles will be saved for short branches that would be incorrectly predicted (17.5% of such branches, 0.35% of instructions), seven cycles when they were incorrectly predicted (7.2%, 0.144%), and one cycle was would be lost if correct (22.5%, 0.45%). A total of 0.03358 cycles per instruction will be saved. Without this optimization, instruction cycles will be 1.2758.

(Although the numbers above, for example, are probably far from reality, with the exception of 1.0 IPC for instructions without branching. Providing a small loop cache reduces the penalty for incorrect prediction (and saves energy in short loops) because access to instruction caching is there are likely to be three out of eight cycles. Adding a cache miss effect will further reduce the percentage improvement from this branch optimization. Avoiding overhead for the predicted “strongly accepted” short branches might turn out to be eznym.)

As a rule, narrow and smaller pipelines are used in the construction and prefer simplicity (for lower costs for design, power and area). Since the command set most likely supports non-redistributable code for many cases with short branches, the incentive to optimize this aspect is further reduced.

For out-of-order implementation, potentially forked instructions must be based as the processor wants to execute later optional instructions. Predication introduces an additional data dependency that must be verified for planning. Typically, instruction schedulers provide only two comparators per instruction and share conditional movement (a simple instruction with three data stream operands: old value, alternative value and condition; four operands. (There are alternative ways to solve this problem, but this answer is already long.)

A non-implementation will also not stop when a branch condition is unavailable. This is a tradeoff between management dependency and data dependency. With accurate branch prediction, the control dependency is extremely inexpensive, but the data dependency can delay the expected forward movement with respect to data operands. (Of course, with the logical dependence of the data, cost prediction becomes somewhat more attractive. Using predicate prediction may be desirable in some cases and will have an advantage over a simple predicate of using dynamic estimates of cost and reliability.)

(Perhaps this suggests that ARM decided to abandon extended prediction in 64-bit AArch64. Although most of this is for command encoding, the advantage of predicates for high-performance implementations seems to be relatively small.)

Compiler issues

The performance of the windless and branching code depends on the predictability of the branch and other factors (including, if you take, any penalty for redirecting the sample), but it is difficult for the compiler to determine the predictability of the branch. Even profile data usually provides only branch frequencies, which can give a pessimistic idea of ​​predictability, since this does not take into account the branch predictor using local or global history. The compiler is also unaware of data availability time and other dynamic aspects. If the condition is available later than the operands used for calculation, then replacing the control dependence (branch prediction) with a data dependence (prediction) can degrade performance. A windless code can also enter more real-time values, which could potentially lead to spills in registers and overhead.

To complicate this, most instruction sets that provide only conditional move or select commands do not provide conditional storage. Although this can be circumvented by using conditional move to select a safe, ignored storage location, this seems like an unattractive complication. In addition, conditional move instructions are often more expensive than simple arithmetic instructions; addition and conditional movement can occupy three cycles in which a correctly predicted branch and addition will take zero (if the addition is branched) or one cycle.

Another complication is that predicate operations are usually ignored by the branch predictor. If a later surviving branch correlates with the condition of the remote branch, the rate of erroneous branch prediction may increase for this later branch. (Prediction prediction can be used to preserve the predictive effects of such remote branches.)

With an increasing emphasis on vectorization, the use of unallocated code becomes even more significant, since branch-based code limits the ability to use operations on the whole vector.

+6
source share

"If this is incorrectly predicted, and the industry was actually taken, then the CPU really only needs to drop 1 instruction from the pipeline (the one that is in the if-case).

It is not as easy as you make it sound. Instructions change various states in the architecture that other instructions rely on (log results, condition flags, memory, etc.). By the time you realize that you are mistakenly predicted, you could potentially have many instructions in the pipeline that started executing based on the state changed by these instructions and all subsequent instructions in the pipeline ... Not to mention instructions that can raise errors / exceptions.

A simple example:

b = 0 f (a == 0) { b = 1; } c = b * 10; if (b == 0) printf("\nc = %d.",c); foo(b); etc.. 

To undo that “one simple instruction” will take a lot of work.

For simple branches with poor predictability, the prediction / cmovs / etc is preferred.

+3
source share

At least for most processors, an incorrectly predicted branch cleans the entire pipeline.

This is a big part of why many (most?) Modern processors also provide predicate instructions.

In ARM, most instructions are specified, that is, the instruction itself can include a condition to say, essentially, "do X, but only if the following condition is met."

Likewise, recent x86 / x64 iterations include some preliminary instructions, such as "CMOV" (conditional move), which work the same way - execute the instruction only if the condition is met.

This does not clean the pipeline - the instruction itself always simply flows along the conveyor. If the condition is not met, the instruction basically just has no effect. The downside is that instructions take lead time, even if they have no effect.

So, in the case you are talking about (an if with a tiny body) that can only be implemented in a few instructions, you can implement them as given instructions.

If the body accepts enough instructions (roughly the size of the instruction pipeline multiplied by some constant factor), instead, it makes more sense to use a conditional jump.

+1
source share

Modern high-performance processors of a non-standard order, as a rule, do not clear the entire pipeline 0 : they usually use something similar to the strategy of painting the branch instructions and all lower instructions. The front part turned red, it will be full of instructions for the incorrectly calculated path, but outside of modern modern kernels more than 100 instructions in flight can be installed, only some of which can be younger than the branch.

This means that the cost of the branch is at least partially related to the surrounding instructions: if the condition of the branch can be checked earlier, the influence of erroneous prediction can be limited or even zero 1 On the other hand, if the branch condition is processed late, after significant resources were spent on the wrong path, the cost can be large (for example, more than 12-20 cycles of “published” penalties for incorrect prediction of the industry, which you often see).


0 The exact terminology is discussed here: the meaning of flushing the conveyor is not entirely clear for processors without order. Here, I mean that the CPU does not reset all instructions that are executed in flight, but may not be completed.

1 In particular, a limiting factor for a certain sequence of instructions may be a chain of dependencies, the current execution of which is far enough behind the front edge of the instruction window that incorrect prediction does not reset any of these instructions and does not slow down the code at all.

+1
source share

All Articles