CudaMemcpy & blocking

I am confused by some comments that I saw about locking and cudaMemcpy. I understand that Fermi HW can simultaneously execute kernels and execute cudaMemcpy.

I read that the lib func function cudaMemcpy () is a lock function. Does this mean that func will block further execution until the copy is fully completed? OR Does this mean that the copy will not start until the previous kernels have finished?

eg. Does this code provide the same locking operation?

SomeCudaCall<<<25,34>>>(someData); cudaThreadSynchronize(); 

vs

 SomeCudaCall<<<25,34>>>(someParam); cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice); 
+4
source share
2 answers

Your examples are equivalent. If you want asynchronous execution, you can use streams or contexts and cudaMemcpyAsync so that you can overlap the execution with the copy.

+5
source

According to the NVIDIA Programming Guide:

To facilitate parallel execution between the host and the device, some function calls are asynchronous: the control returns to the host stream before the device completes the requested task. It:

  • the kernel starts;
  • The memory copies between two addresses in the same device memory;
  • The memory copies from the host to the device a memory block with a size of not more than 64 KB;
  • Copies of memory performed by functions that are suffixed with Async;
  • Recall memory function

As long as your transfer size is greater than 64 KB, your examples are equivalent.

+2
source

All Articles