I have a CUDA kernel called from a for loop. Sort of
for(i=0; i<10; i++) { myKernel<<<1000,256>>>(A,i); }
Suppose now that I have an NVIDIA card with 15 streaming multiprocessors (SMs). Also suppose for simplicity that only one block can be mapped to SM, which basically says that most of the time I will have 15 blocks executed on the device. Since the execution of the kernel is asynchronous, basically a call with i = 1 will be lined up for execution immediately after starting the first kernel (with i = 0).
My question is this: at some point, when the first kernel is running (with i = 0), only 14 SM will be occupied, then only 13, then only 12, and then only 11, etc.
Will the kernel with i = 1 be sent for execution on the device as soon as one SM is available, or will the launch of this second kernel wait until all SM-servers finish working with the first core (with i = 0)?
Suppose also that I work in the same CUDA thread.
source share