Is cudaMemcpy from host to device running in parallel?

I am wondering if cudaMemcpy is running on the CPU or GPU when we copy from the host to the device?

In other words, is it a copy of a sequential process or is it running in parallel?

Let me explain why I ask about this: I have an array of 5 million elements. Now I want to copy 2 sets of 50,000 elements from different parts of the array. SO, I thought it would be faster to first form a large array of all the elements that I want to copy to the CPU, and then make only 1 large transfer or just call 2 cudaMemcpy, one for each set.

If cudaMemcpy runs in parallel, then I think the second approach will be faster, since you do not need to copy 100,000 elements in series to the CPU first

+5
source share
2 answers

I am wondering if cudaMemcpy is running on a CPU or GPU, when do we copy from the host to the device?

In the case of a synchronous API call with regular user-accessible memory for the page, the answer will be executed on both. The driver must first copy the data from the source memory to the source DMA buffer located on the host, and then transfer data to the GPU that are waiting for data to be transferred. The GPU then performs the transfer. The process is repeated as many times as necessary for a full copy from the source memory on the GPU.

, DMA ( /, ).

, , , , , .

+3

, GPU . , , ( ) .

, - . , , . .

0

All Articles