Are the Kepler CC3.0 GPU processors not only a pipelined architecture, but also superscalar?

The documentation for CUDA 6.5 says: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz3PIXMTktb

5.2.3. Multiprocessor level

...

  • 8L for devices of computational ability 3.x, since multiprocessor problems are a pair of instructions for each deformation in one cycle for four bases of time, as indicated in Compute Capability 3.x.

Does this mean that Kepler CC3.0 GPUs are not only a pipelined architecture, but also superscalar?

  • Conveyorization - these two sequences are performed in parallel (different operations at a time):

    • LOAD [addr1] β†’ ADD β†’ STORE [addr1] β†’ NOP
    • NOP β†’ LOAD [addr2] β†’ ADD β†’ STORE [addr2]
  • Superscalar - these two sequences are executed in parallel (the same operations at a time):

    • LOAD [reg1] β†’ ADD β†’ STORE [reg1]
    • LOAD [reg2] β†’ ADD β†’ STORE [reg2]
+3
gpgpu cuda nvidia gpu-programming kepler
source share
1 answer

Yes, warp schedulers in Kepler can schedule two teams per cycle if:

  • instructions are independent
  • instructions taken from the same warp
  • SM has sufficient execution resources for both commands.

If this meets your definition of superscalar, then it is superscalar.

As for pipelining, I view pipelining differently. Various execution units in Kepler SM conveyor belts. Let’s take a floating-point number as an example.

At given clock ticks, the Kepler shuttle scheduler can schedule a floating point multiplication operation by a floating point unit. The results of this operation may not be displayed for a number of hours later (i.e. they are not available in the next measure), but in the next measure, a new floating point operation may be scheduled on the same floating point functional units, because the hardware (in this case, floating point modules) pipelined.

clock operation pipeline stage result 0 MPY1 -> PS1 1 PS2 ... ... N-1 PSN -> result1 

in the next hours after measure 0, a new multiplication command can be scheduled on the same HW, and the corresponding result will appear in the next cycle after result1 .

Not sure if this is what you meant by β€œdifferent operations at a time”

+9
source share

All Articles