Why does an increase in pipeline depth not always mean an increase in throughput?

Perhaps this is more of a moot point, but I thought stackoverflow might be the right place to ask. I am learning the concept of pipelining instructions. I was taught that conveyor throughput increases as the number of pipeline steps increases, but in some cases the throughput may not change. In what conditions is this happening? I think stopping and branching might be the answer to the question, but I wonder if I am missing something important.

+7
assembly intel pipelining
source share
5 answers

Everywhere you can stop other teams, waiting for the result or misses in the cache. Conveyorization alone does not guarantee that operations are completely independent. Here's a great presentation on the intricacies of x86 Intel / AMD architecture: http://www.infoq.com/presentations/click-crash-course-modern-hardware

This explains all this in detail and covers some decisions on how to further increase throughput and hide latency. JustJeff mentioned non-standard execution for one, and you have shadow registers that are not displayed by the programmer (more than 8 registers on x86), and you also have branch prediction.

+4
source share

I agree. The biggest problems are kiosks (waiting for the results of the previous instructions) and incorrect branch prediction. If your conveyor is at 20 steps, and you stop waiting for the results of a condition or operation, you will wait longer than if your conveyor had only 5 steps. If you are predicting the wrong branch, you should drop 20 commands from the pipeline, as opposed to 5.

I suppose you might have a deep pipeline where several stages try to access the same hardware (ALU, etc.), which will lead to performance hit, although I hope you will add enough extra units to support each stage.

+2
source share

The parallelism command level has diminishing returns. In particular, data dependencies between commands determine possible parallelism.

Consider the case of reading after writing (known as RAW in textbooks).

In the syntax where the first operand gets the result, consider this example.

10: add r1, r2, r3 20: add r1, r1, r1 

The result of line 10 should be known by the time the calculation of line 10 begins. Redirecting data mitigates this problem, but ... only until the data becomes known.

+1
source share

I also think that increasing pipeline processing beyond the time that the long instruction in the series will execute will not lead to an increase in productivity. I really think that locking and branching are fundamental problems.

0
source share

Definitely stalls / bubbles in long pipelines cause a huge loss in throughput. And, of course, the longer the pipeline, the more hours are lost.

I tried for a long time to think about other scenarios where longer pipelines can lead to a loss of performance, but all this returns to the kiosks. (And the number of execution units and release circuits, but they do not have much to do with the length of the conveyor.)

0
source share

All Articles