CUDA kernel call from for loop

I have a CUDA kernel called from a for loop. Sort of

for(i=0; i<10; i++) { myKernel<<<1000,256>>>(A,i); } 

Suppose now that I have an NVIDIA card with 15 streaming multiprocessors (SMs). Also suppose for simplicity that only one block can be mapped to SM, which basically says that most of the time I will have 15 blocks executed on the device. Since the execution of the kernel is asynchronous, basically a call with i = 1 will be lined up for execution immediately after starting the first kernel (with i = 0).

My question is this: at some point, when the first kernel is running (with i = 0), only 14 SM will be occupied, then only 13, then only 12, and then only 11, etc.

Will the kernel with i = 1 be sent for execution on the device as soon as one SM is available, or will the launch of this second kernel wait until all SM-servers finish working with the first core (with i = 0)?

Suppose also that I work in the same CUDA thread.

+4
source share
1 answer

A kernel run in the same thread is serialized. Kernel phenomena from different threads may overlap if there are sufficient resources (SM, shared memory, etc.).

+4
source

All Articles