Select two arrays calling cudaMalloc once

Memory allocation is one of the most time-consuming operations in the GPU, so I wanted to allocate 2 arrays by calling cudaMalloc once, using the following code:

 int numElements = 50000; size_t size = numElements * sizeof(float); //declarations-initializations float *d_M = NULL; err = cudaMalloc((void **)&d_M, 2*size); //error checking // Allocate the device input vector A float *d_A = d_M; // Allocate the device input vector B float *d_B = d_M + size; err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); //error checking err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); //error checking 

The source code is inside the samples folder of the cuda toolkit named vectorAdd.cu, so you can assume that h_A, h_B are running correctly, and the code works without the change I made.
As a result, the second cudaMemcpy returned an error with the message invalid argument .

It seems that the operation "d_M + size" does not return what someone expects, since the device memory behaves differently, but I do not know how to do it.

Is it possible to make my approach (calling cudaMalloc once to allocate memory for two arrays)? Any comments / answers on whether this is a good approach are also welcome.

UPDATE
In the answers of Robert and dreamcrash, I suggested adding the number of elements (numElements) to the d_M pointer, not the size, which is the number of bytes. For reference, there was no observed acceleration.

+4
source share
1 answer

You just need to replace this:

 float *d_B = d_M + size; 

for

 float *d_B = d_M + numElements; 

This is pointer arithmetic, if you have an array of floats R = [1.0,1.2,3.3,3.4] , you can print the fist position by doing printf("%f",*R); , and if you want to print a second?

you just need to do printf("%f\n",*(++R)); thus r [0] + 1. You are not doing r[0] + sizeof(float) as you did.

When you execute r[0] + sizeof(float) , you will get access to the element at position r [4], since size (float) = 4.

When you do this float *d_B = d_M + numElements; , the compiler assumes that d_b will allocate continuums in memory, and each element will have the size of a float. Therefore, you do not need to say the distance in bytes, you just need to specify the distance in terms of elements, the compiler will do the math for you. This approach makes it easier for the person, because he more intuitively expresses the arithmetic of the pointer in terms of the element, and not in bytes.


You said that the result was that the second cudaMemcpy returned an error with an invalid message argument:

If you print the number corresponding to this error, it prints 11, you check the CUDA API , you will see that this error corresponds to:

cudaErrorInvalidValue

This means that one or more parameters passed to the API call is not in a valid range of values.

In your example means float *d_B = d_M + size; out of range.

You have allocated a place for floating 100000 , d_a will start from 0 to 50,000, but according to your code, d_b will start with numElements * sizeof(float); 50000 * 4 = 200000, since 200000> 100000 you will get an invalid argument .

+4
source

All Articles