Are the Kepler CC3.0 GPU processors not only a pipelined architecture, but also superscalar?

Question

Are the Kepler CC3.0 GPU processors not only a pipelined architecture, but also superscalar?

The documentation for CUDA 6.5 says: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz3PIXMTktb

5.2.3. Multiprocessor level
...
8L for devices of computational ability 3.x, since multiprocessor problems are a pair of instructions for each deformation in one cycle for four bases of time, as indicated in Compute Capability 3.x.

Does this mean that Kepler CC3.0 GPUs are not only a pipelined architecture, but also superscalar?

Conveyorization - these two sequences are performed in parallel (different operations at a time):
- LOAD [addr1] → ADD → STORE [addr1] → NOP
- NOP → LOAD [addr2] → ADD → STORE [addr2]
Superscalar - these two sequences are executed in parallel (the same operations at a time):
- LOAD [reg1] → ADD → STORE [reg1]
- LOAD [reg2] → ADD → STORE [reg2]

+3

gpgpu cuda nvidia gpu-programming kepler

Alex Jan 19 '15 at 19:51

source share

1 answer

Robert Crovella · Accepted Answer · 2015-01-19T20:07:34+0000

Yes, warp schedulers in Kepler can schedule two teams per cycle if:

instructions are independent
instructions taken from the same warp
SM has sufficient execution resources for both commands.

If this meets your definition of superscalar, then it is superscalar.

As for pipelining, I view pipelining differently. Various execution units in Kepler SM conveyor belts. Let’s take a floating-point number as an example.

At given clock ticks, the Kepler shuttle scheduler can schedule a floating point multiplication operation by a floating point unit. The results of this operation may not be displayed for a number of hours later (i.e. they are not available in the next measure), but in the next measure, a new floating point operation may be scheduled on the same floating point functional units, because the hardware (in this case, floating point modules) pipelined.

clock operation pipeline stage result 0 MPY1 -> PS1 1 PS2 ... ... N-1 PSN -> result1

in the next hours after measure 0, a new multiplication command can be scheduled on the same HW, and the corresponding result will appear in the next cycle after result1 .

Not sure if this is what you meant by “different operations at a time”

Are the Kepler CC3.0 GPU processors not only a pipelined architecture, but also superscalar?

More articles: