Most general-purpose processors clean the pipeline from incorrect branch prediction. The negative influence of conditional branches motivated proposals for vigorous execution (where both paths are executed and the correct path chosen later) and dynamic prediction (where instructions are given in the shadow of the branch) in addition to extensive research on branch prediction (as well as other methods). (The page labeled Smotherman awaiting execution contains some details and links. I would add Hyesoon Kim et al. “Desired branches: combining conditional branching and predication for Adaptive Predicated Execution, 2005, as an important article.)
IBM POWER7 appears to be the first core processor to implement anything more complex than prefetching an alternative path (i.e., impatient sampling), and it only handles one instance of a command. (POWER7 uses a confidence rating of the branch prediction to choose whether to predict or use the prediction.)
Expected implementation has the obvious problem of resource use explosion. Even with selective zeal based on confidence in branch prediction, depth of speculation, and availability of resources (information is available for the front-end), it can be more effective to think more deeply along one path. Finding the junction points of multiple paths and eliminating excessive redundant calculations can also complicate the task. (Ideally, management-independent operations were performed only once, and the integration and data flow would be optimized, but such an optimization adds complexity.)
For a deep pipelined processor in order, it might seem attractive to predict short branches ahead if they are not accepted, and only dump back in the pipeline instructions that target the selected branch when the actual branch. If only one such branch is allowed at a time (other branches use prediction), adding one bit to each command can control whether it will be converted to nop or executed. (If only the branching of one command is handled, which allows several branches in the pipeline to not be particularly complicated.)
It will be like delayed cancellation delay intervals. MIPS has Branch Likely instructions that are canceled if not accepted, and they are marked as deprecated in version 2.62. Although some of the rationale for this seems to separate the implementation from the interface and the desire to restore the command encoding space, this solution also hints that the concept has some problems.
If this was done for all short branches along, it would drop the commands when the branch was correctly predicted as taken. (Note that this penalty may be less if the received branches always experience a delay in redirecting the sample, which would be more likely using caching of several instruction cycles in the processor processed by the pipeline. In this case, the selection has the same performance as correctly predicted received branch. However, it can be argued that a special processor case of such short received branches to minimize such sample bubbles.)
As an example, consider a scalar pipeline (instructions without branching per cycle equal to 1.0) with a resolution of branching at the end of the eighth stage and the absence of a penalty of redirection of redirection with correctly predicted received branches that control branches with one instruction. Suppose that for such short front branches (unforeseen in directions), 75% accuracy of branch prediction (2% of instructions adopted in 30% of cases) and 93% accuracy for other branches (18% of instructions). Eight cycles will be saved for short branches that would be incorrectly predicted (17.5% of such branches, 0.35% of instructions), seven cycles when they were incorrectly predicted (7.2%, 0.144%), and one cycle was would be lost if correct (22.5%, 0.45%). A total of 0.03358 cycles per instruction will be saved. Without this optimization, instruction cycles will be 1.2758.
(Although the numbers above, for example, are probably far from reality, with the exception of 1.0 IPC for instructions without branching. Providing a small loop cache reduces the penalty for incorrect prediction (and saves energy in short loops) because access to instruction caching is there are likely to be three out of eight cycles. Adding a cache miss effect will further reduce the percentage improvement from this branch optimization. Avoiding overhead for the predicted “strongly accepted” short branches might turn out to be eznym.)
As a rule, narrow and smaller pipelines are used in the construction and prefer simplicity (for lower costs for design, power and area). Since the command set most likely supports non-redistributable code for many cases with short branches, the incentive to optimize this aspect is further reduced.
For out-of-order implementation, potentially forked instructions must be based as the processor wants to execute later optional instructions. Predication introduces an additional data dependency that must be verified for planning. Typically, instruction schedulers provide only two comparators per instruction and share conditional movement (a simple instruction with three data stream operands: old value, alternative value and condition; four operands. (There are alternative ways to solve this problem, but this answer is already long.)
A non-implementation will also not stop when a branch condition is unavailable. This is a tradeoff between management dependency and data dependency. With accurate branch prediction, the control dependency is extremely inexpensive, but the data dependency can delay the expected forward movement with respect to data operands. (Of course, with the logical dependence of the data, cost prediction becomes somewhat more attractive. Using predicate prediction may be desirable in some cases and will have an advantage over a simple predicate of using dynamic estimates of cost and reliability.)
(Perhaps this suggests that ARM decided to abandon extended prediction in 64-bit AArch64. Although most of this is for command encoding, the advantage of predicates for high-performance implementations seems to be relatively small.)
Compiler issues
The performance of the windless and branching code depends on the predictability of the branch and other factors (including, if you take, any penalty for redirecting the sample), but it is difficult for the compiler to determine the predictability of the branch. Even profile data usually provides only branch frequencies, which can give a pessimistic idea of predictability, since this does not take into account the branch predictor using local or global history. The compiler is also unaware of data availability time and other dynamic aspects. If the condition is available later than the operands used for calculation, then replacing the control dependence (branch prediction) with a data dependence (prediction) can degrade performance. A windless code can also enter more real-time values, which could potentially lead to spills in registers and overhead.
To complicate this, most instruction sets that provide only conditional move or select commands do not provide conditional storage. Although this can be circumvented by using conditional move to select a safe, ignored storage location, this seems like an unattractive complication. In addition, conditional move instructions are often more expensive than simple arithmetic instructions; addition and conditional movement can occupy three cycles in which a correctly predicted branch and addition will take zero (if the addition is branched) or one cycle.
Another complication is that predicate operations are usually ignored by the branch predictor. If a later surviving branch correlates with the condition of the remote branch, the rate of erroneous branch prediction may increase for this later branch. (Prediction prediction can be used to preserve the predictive effects of such remote branches.)
With an increasing emphasis on vectorization, the use of unallocated code becomes even more significant, since branch-based code limits the ability to use operations on the whole vector.