as a rule, what nvidia writes as a document makes people (including me) become more embarrassed! (only me maybe!)
you are interested in performance, basically it says that there shouldn’t be (maybe), but you should. In principle, the GPU architecture is similar in nature. they start something, and something happens. then they are trying to explain it. and then they serve it to you.
in the end should probably run some tests and see which configuration gives the best result.
virtual architecture is what allows you to think freely. you must obey this, use as much as you want, you can assign almost everything as the number of threads and blocks, it does not matter, it will be transferred to PTX, and the device will start it.
the only problem: if you assign more than 1024 threads to one block, you will get 0 s as a result, because the device (the real architecture) does not support it.
or, for example, your device supports CUDA 1.2, you can define double pointer variables in your code, but again you will get 0 s as a result, because the device simply cannot start it.
you should know that every 32 threads (e.g. warps) must have access to one position in memory, otherwise your access will be serialized, etc.
So, I hope you have a point by now. This is a relatively new science, and the GPU is a really complex hardware architecture, everyone is trying to do everything possible, but this is a testing game and a little knowledge of the actual architecture behind CUDA. I suggest looking for the GPU architecture and see how virtual threads and thread blocks are actually implemented.
Soroosh bateni
source share