Are the cores GPU / CUDA SIMD?

Take the nVidia Fermi Compute Architecture . It says:

The first Fermi-based GPU, implemented with 3.0 billion transistors, has up to 512 CUDA cores. The CUDA kernel executes a floating-point instruction or an integer in hours for the stream. 512 CUDA cores are organized in 16 SM with 32 cores each.

[...]

Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and a floating point unit (FPU).

[...]

At Fermi, the newly developed integer ALU supports full 32-bit precision for all instructions that meet standard programming language requirements. Integer ALU is also optimized to efficiently support operations with 64-bit and advanced precision. IN

From what I know, and what is incomprehensible to me, is that the GPUs execute threads in the so-called warps, each warp consists of ~ 32 threads. Only one core is assigned to each strain (is this true?). Does this mean that each of the 32 cores of one SM is a SIMD processor, where one team processes 32 pieces of data ? If so, why do we say that there are 32 threads at the core, and not one SIMD stream? Why are kernels sometimes called scalar processors rather than vector processors?

+9
simd gpu gpgpu cuda
source share
2 answers

Each warp is assigned to only one core (is this true?).

No that's not true. The argument is a logical assembly of 32 threads of execution. In order to execute one command from one warp, the warp scheduler should usually plan 32 execution units (or "kernels", although the definition of "kernel" is somewhat weakened).

The cores are actually scalar processors, not vector processors. 32 cores (or execution units) are sorted by the warp scheduler to execute one command across 32 threads, from which the nickname "SIMT" derives.

+14
source share

CUDA "cores" can be thought of as SIMD bands.

First, recall that the term "CUDA core" is the nVIDIA marketing language. They are not cores in the same way that a processor has cores. Similarly, “CUDA threads” do not match the threads we know about processors.

The equivalent of the CPU core in the GPU is a "symmetric multiprocessor" : it has its own scheduler / command manager, its own L1 cache, its own shared memory, etc. These are blocks of CUDA threads, not deformations, which are assigned to the GPU core, i.e. streaming multiprocessor. Inside SM, warps are chosen so that they are scheduled instructions for the entire warp. From a CUDA point of view, these are 32 separate threads that are blocked by instructions; but this is actually no different from saying that the warp is like a single thread that only executes 32-channel SIMD instructions. Of course, this is not a perfect analogy, but I feel it sounds beautiful. Something that you lack on the CPU SIMD lines is masking, which is actively performed by bands where inactive bands will not have the effect of setting the values ​​of the active memory registers, writing to memory, etc.

I hope this makes an intuitive sense for you (or perhaps you yourself figured it out in the last 2 years).

+4
source share

All Articles