CUDA kernel call from for loop

Question

CUDA kernel call from for loop

I have a CUDA kernel called from a for loop. Sort of

for(i=0; i<10; i++) { myKernel<<<1000,256>>>(A,i); }

Suppose now that I have an NVIDIA card with 15 streaming multiprocessors (SMs). Also suppose for simplicity that only one block can be mapped to SM, which basically says that most of the time I will have 15 blocks executed on the device. Since the execution of the kernel is asynchronous, basically a call with i = 1 will be lined up for execution immediately after starting the first kernel (with i = 0).

My question is this: at some point, when the first kernel is running (with i = 0), only 14 SM will be occupied, then only 13, then only 12, and then only 11, etc.

Will the kernel with i = 1 be sent for execution on the device as soon as one SM is available, or will the launch of this second kernel wait until all SM-servers finish working with the first core (with i = 0)?

Suppose also that I work in the same CUDA thread.

+4

cuda

user1586099 Aug 08 '12 at 10:39

source share

1 answer

Eugene · Answer 1 · 2012-08-08T23:20:43+0000

A kernel run in the same thread is serialized. Kernel phenomena from different threads may overlap if there are sufficient resources (SM, shared memory, etc.).

CUDA kernel call from for loop

More articles: