Differences between Virtual and Real Kuda Architecture

Trying to understand the differences between virtual and real cuda architecture and how various configurations will affect program performance, for example

-gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -gencode arch=compute_21,code=sm_21 ... 

The following explanation was given in the NVCC manual,

Compilation of the GPU is performed through an intermediate representation, PTX ([...]), which can be considered as an assembly for the virtual architecture of the GPU. Unlike real graphics processor, such a virtual graphics processor is completely determined by the set of features or functions that it provides the application. In particular, the virtual architecture of the GPU provides (mostly), and binary instruction encoding is not a problem, since PTX programs are always presented in text format. Consequently, the nvcc compilation team always uses two architectures: architecture calculation to specify the virtual intermediate architecture, as well as the real GPU architecture to indicate the intended processor to execute. For such an nvcc command to be valid, the real architecture must be an implementation (one way or another) of the virtual architecture. This is explained below. The chosen virtual architecture is more likely an expression of the GPU capabilities that the application requires: using the smallest virtual architecture still allows the widest range of actual architectures for the second nvcc stage. Conversely, specifying a virtual architecture that provides functions not used by the application unnecessarily limits the set of possible GPUs that can be specified in the second stage of nvcc.

But it’s still not entirely clear how different configurations affect performance (or maybe it only affects the choice of physical GPU devices?). In particular, this statement is confusing to me:

In particular, the virtual architecture of the GPU provides (mostly), and binary instruction encoding is not a problem, since PTX programs are always presented in text format.

+7
source share
4 answers

NVIDIA CUDA Compiler Driver NVCC User Guide The Compiling GPU section provides a very detailed description of the virtual and physical architecture and how the concepts are used in the assembly process.

Virtual architecture indicates the set of functions that the code is aimed at. The table below shows the evolution of virtual architecture. When compiling, you must specify the lowest virtual architecture, which has a sufficient set of functions that allows the program to run in the widest range of physical architectures.

List of virtual architecture functions (from the user manual)

 compute_10 Basic features compute_11 + atomic memory operations on global memory compute_12 + atomic memory operations on shared memory + vote instructions compute_13 + double precision floating point support compute_20 + Fermi support compute_30 + Kepler support 

Physical architecture defines the implementation of the GPU. This provides a compiler with a set of commands, delayed commands, bandwidth of commands, resource sizes, etc., so that the compiler can optimally convert the virtual architecture into binary code.

You can specify multiple pairs of virtual and physical architecture for the compiler and provide the compiler with the final PTX and binary in a single binary file. At run time, the CUDA driver will select the best view for the installed physical device. If binary is not provided in fatbinary, the driver may use JIT's best implementation of PTX.

+6
source

The "Virtual Architecture" code will be compiled by the "exactly at the point in time" compiler before downloading to the device. AFAIK, this is the same compiler as one NVCC when creating the "physical architecture" code offline, so I don’t know if there will be any differences in application performance.

Basically, every generation of CUDA hardware is incompatible with the previous generation - imagine the next generation of Intel processors equipped with an ARM instruction set. In this way, virtual architectures provide an intermediate view of the CUDA application that can be compiled for compatible hardware. Each generation of equipment introduces new features (such as atomics, CUDA Dynamic Parallelism) that require new instructions - for which you need new virtual architectures.

Basically, if you want to use CDP, you should compile SM SM. You can compile it to the binary code of the device, which will have an assembly code for a certain generation of CUDA devices, or you can compile it with the PTX code, which can be compiled into a device assembly for any device that provides these functions.

+3
source

Virtual architecture determines what capabilities the graphics processor has, and real architecture determines how it does it.

I can’t come up with any concrete examples from my hands. An example (possibly incorrect) can be a virtual GPU that determines the number of cores that a card has, so code is generated targeting that number of cores, while a real card may have slightly more for redundancy (or slightly less due to an error) and some kernel mapping methods that are actually used that can be placed on top of the more general code generated in the first step.

You can imagine PTX type code as an assembly code that targets a specific architecture, which can then be compiled into machine code for a specific processor. Targeting assembly code for the right processor type will, in general, generate better machine code.

+1
source

as a rule, what nvidia writes as a document makes people (including me) become more embarrassed! (only me maybe!)

you are interested in performance, basically it says that there shouldn’t be (maybe), but you should. In principle, the GPU architecture is similar in nature. they start something, and something happens. then they are trying to explain it. and then they serve it to you.

in the end should probably run some tests and see which configuration gives the best result.

virtual architecture is what allows you to think freely. you must obey this, use as much as you want, you can assign almost everything as the number of threads and blocks, it does not matter, it will be transferred to PTX, and the device will start it.

the only problem: if you assign more than 1024 threads to one block, you will get 0 s as a result, because the device (the real architecture) does not support it.

or, for example, your device supports CUDA 1.2, you can define double pointer variables in your code, but again you will get 0 s as a result, because the device simply cannot start it.

you should know that every 32 threads (e.g. warps) must have access to one position in memory, otherwise your access will be serialized, etc.

So, I hope you have a point by now. This is a relatively new science, and the GPU is a really complex hardware architecture, everyone is trying to do everything possible, but this is a testing game and a little knowledge of the actual architecture behind CUDA. I suggest looking for the GPU architecture and see how virtual threads and thread blocks are actually implemented.

0
source

All Articles