I am wondering if cudaMemcpy is running on the CPU or GPU when we copy from the host to the device?
In other words, is it a copy of a sequential process or is it running in parallel?
Let me explain why I ask about this: I have an array of 5 million elements. Now I want to copy 2 sets of 50,000 elements from different parts of the array. SO, I thought it would be faster to first form a large array of all the elements that I want to copy to the CPU, and then make only 1 large transfer or just call 2 cudaMemcpy, one for each set.
If cudaMemcpy runs in parallel, then I think the second approach will be faster, since you do not need to copy 100,000 elements in series to the CPU first
source
share