You just need to replace this:
float *d_B = d_M + size;
for
float *d_B = d_M + numElements;
This is pointer arithmetic, if you have an array of floats R = [1.0,1.2,3.3,3.4] , you can print the fist position by doing printf("%f",*R); , and if you want to print a second?
you just need to do printf("%f\n",*(++R)); thus r [0] + 1. You are not doing r[0] + sizeof(float) as you did.
When you execute r[0] + sizeof(float) , you will get access to the element at position r [4], since size (float) = 4.
When you do this float *d_B = d_M + numElements; , the compiler assumes that d_b will allocate continuums in memory, and each element will have the size of a float. Therefore, you do not need to say the distance in bytes, you just need to specify the distance in terms of elements, the compiler will do the math for you. This approach makes it easier for the person, because he more intuitively expresses the arithmetic of the pointer in terms of the element, and not in bytes.
You said that the result was that the second cudaMemcpy returned an error with an invalid message argument:
If you print the number corresponding to this error, it prints 11, you check the CUDA API , you will see that this error corresponds to:
cudaErrorInvalidValue
This means that one or more parameters passed to the API call is not in a valid range of values.
In your example means float *d_B = d_M + size; out of range.
You have allocated a place for floating 100000 , d_a will start from 0 to 50,000, but according to your code, d_b will start with numElements * sizeof(float); 50000 * 4 = 200000, since 200000> 100000 you will get an invalid argument .