Usually you want to convert a test in the upper loop to a test in the lower loop. To do this, you often have to jump (more or less) to the middle of the loop body for the first iteration. In the pseudo code, what you have now is basically:
initialize sum beginning: load a word if (done) goto end add to sum increment pointer goto beginning end:
to optimize this, we want to change the structure to something like this:
initialize sum goto start_loop beginning: add to sum increment pointer start_loop: load a word if (!done) goto beginning
Thus, only one jump per cycle instead of two (and this is a short jump back, so the goal will almost always be in the cache).
However, I must add that this optimization is really basically outdated - with decent branch prediction, an unconditional jump is usually almost free.
Edit: after mentioning the unfolding loop, I will add two cents. Predicting branches generally makes a cycle that disconnects out of date if you cannot use it to execute additional instructions in parallel. This is not a problem here, but often useful in real life. For example, if we assume that there is a processor with two separate adders, we can deploy two iterations of the loop and add the results of these iterations to two separate target registers, so we use both adders, adding two values ββat the same time. Then, when the loop ends, we combine the two registers to get the final value. In C-like psuedo-code, it would look something like this:
odds = 0 evens = 0 do { evens += pointer[0]; odds += pointer[1]; pointer += 2; while (pointer[0] && pointer[1]); total = odds + evens;
As written, this adds secondary additional requirements to two consecutive zeros to terminate the sequence instead of one, and at least two elements in the added array. Please note, however, that this is not exactly a loop reversal, which gives the main advantage here, but parallel execution.
If this is not the case, we really see the advantage of expanding the loop if the branch that was not taken is cheaper than the branch that is taken (even if both of them are correctly predicted). On older processors (for example, older Intels) that accept a branch, be punished for a non-returned branch. At the same time, the deployed loop will use more cache space. Thus, on a modern processor, a common loss often occurs (if, as I said, we can use the spread to get parallelism).