Optimize this build code

Question

Optimize this build code

Now I'm studying a course in computer architecture, and we move on to basic instructions like R and I (also this is RISC architecture), etc. I can. It seems you don’t know how to optimize this code.

Explanation: This code adds words to the array (indicated by the $ s1 symbol) until it reaches zero. The result is stored in $ t1. $ t0 contains the current word.

add $t1, $zero, $zero # Initialize result to zero again: lw $t0, 0($s1) # Load the word from the array beq $t0, $zero, done # Terminate if current word is a zero add $t1, $t1, $t0 # Add current word to result addi $s1, $s1, 4 # Point to the next word in the array beq $t1, $t1, again # Loop again done: nop # Do nothing

I find it difficult to optimize the code. I feel that beq $t1, $t1, again (since this is always true) is not needed, but I'm not sure how to remove it. Here is my attempt, but now I understand that my code will not complete.

  add $t1, $zero, $zero # Initialize result to zero again: lw $t0, 0($s1) # Load the word from the array add $t1, $t1, $t0 # Add current word to result addi $s1, $s1, 4 # Point to the next word in the array bne $t1, $zero, again # If result is not zero, loop done: nop # Do nothing

I never check for a terminating zero and jump to the end. But if I add one more check, will the code not be the same?

+7

assembly computer-architecture

ardavis Jan 26 '12 at 19:15

source share

2 answers

Expand your loop:

 add $t1, $zero, $zero # Initialize result to zero again: lw $t0, 0($s1) # Load the word from the array beq $t0, $zero, done # Terminate if current word is a zero add $t1, $t1, $t0 # Add current word to result addi $s1, $s1, 4 # Point to the next word in the array lw $t0, 0($s1) # Load the word from the array beq $t0, $zero, done # Terminate if current word is a zero add $t1, $t1, $t0 # Add current word to result addi $s1, $s1, 4 # Point to the next word in the array lw $t0, 0($s1) # Load the word from the array beq $t0, $zero, done # Terminate if current word is a zero add $t1, $t1, $t0 # Add current word to result addi $s1, $s1, 4 # Point to the next word in the array lw $t0, 0($s1) # Load the word from the array beq $t0, $zero, done # Terminate if current word is a zero add $t1, $t1, $t0 # Add current word to result addi $s1, $s1, 4 # Point to the next word in the array # and so on to a reasonable amount, 4-8 times are common. b again # Loop again done: nop

+1

Michael dorgan Jan 26 '12 at 20:38

source share

Jerry Coffin · Accepted Answer · 2012-01-26T20:43:21+0000

Usually you want to convert a test in the upper loop to a test in the lower loop. To do this, you often have to jump (more or less) to the middle of the loop body for the first iteration. In the pseudo code, what you have now is basically:

  initialize sum beginning: load a word if (done) goto end add to sum increment pointer goto beginning end:

to optimize this, we want to change the structure to something like this:

  initialize sum goto start_loop beginning: add to sum increment pointer start_loop: load a word if (!done) goto beginning

Thus, only one jump per cycle instead of two (and this is a short jump back, so the goal will almost always be in the cache).

However, I must add that this optimization is really basically outdated - with decent branch prediction, an unconditional jump is usually almost free.

Edit: after mentioning the unfolding loop, I will add two cents. Predicting branches generally makes a cycle that disconnects out of date if you cannot use it to execute additional instructions in parallel. This is not a problem here, but often useful in real life. For example, if we assume that there is a processor with two separate adders, we can deploy two iterations of the loop and add the results of these iterations to two separate target registers, so we use both adders, adding two values at the same time. Then, when the loop ends, we combine the two registers to get the final value. In C-like psuedo-code, it would look something like this:

 odds = 0 evens = 0 do { evens += pointer[0]; odds += pointer[1]; pointer += 2; while (pointer[0] && pointer[1]); total = odds + evens;

As written, this adds secondary additional requirements to two consecutive zeros to terminate the sequence instead of one, and at least two elements in the added array. Please note, however, that this is not exactly a loop reversal, which gives the main advantage here, but parallel execution.

If this is not the case, we really see the advantage of expanding the loop if the branch that was not taken is cheaper than the branch that is taken (even if both of them are correctly predicted). On older processors (for example, older Intels) that accept a branch, be punished for a non-returned branch. At the same time, the deployed loop will use more cache space. Thus, on a modern processor, a common loss often occurs (if, as I said, we can use the spread to get parallelism).

Optimize this build code

More articles: