CUDA: allocating an array of structures within a structure

I have these structures:

typedef struct neuron { float* weights; int n_weights; }Neuron; typedef struct neurallayer { Neuron *neurons; int n_neurons; int act_function; }NLayer; 

The NLayer structure may contain an arbitrary number of Neuron

I tried to isolate the "NLayer" structure with 5 "Neurons" from the host in this way:

 NLayer* nL; int i; int tmp=9; cudaMalloc((void**)&nL,sizeof(NLayer)); cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron)); for(i=0;i<5;i++) cudaMemcpy(&nL->neurons[i].n_weights,&tmp,sizeof(int),cudaMemcpyHostToDevice); 

... Then I tried to change the variable nL-> neurons [0] .n_weights "with this kernel:

 __global__ void test(NLayer* n) { n->neurons[0].n_weights=121; } 

but at compile time, nvcc returns this "warning" related to a single kernel line:

 Warning: Cannot tell what pointer points to, assuming global memory space 

and when the kernel completes its work, the structure will begin to be inaccessible.

It is very likely that I am doing something wrong during the allocation .... can someone help me? Thank you very much and sorry for my english! :)

UPDATE:

Thanks to aland, I changed my code by creating this function, which should highlight an instance of the "NLayer" structure:

 NLayer* setNLayer(int numNeurons,int weightsPerNeuron,int act_fun) { int i; NLayer h_layer; NLayer* d_layer; float* d_weights; //SET THE LAYER VARIABLE OF THE HOST NLAYER h_layer.act_function=act_fun; h_layer.n_neurons=numNeurons; //ALLOCATING THE DEVICE NLAYER if(cudaMalloc((void**)&d_layer,sizeof(NLayer))!=cudaSuccess) puts("ERROR: Unable to allocate the Layer"); //ALLOCATING THE NEURONS ON THE DEVICE if(cudaMalloc((void**)&h_layer.neurons,numNeurons*sizeof(Neuron))!=cudaSuccess) puts("ERROR: Unable to allocate the Neurons of the Layer"); //COPING THE HOST NLAYER ON THE DEVICE if(cudaMemcpy(d_layer,&h_layer,sizeof(NLayer),cudaMemcpyHostToDevice)!=cudaSuccess) puts("ERROR: Unable to copy the data layer onto the device"); for(i=0;i<numNeurons;i++) { //ALLOCATING THE WEIGHTS' ARRAY ON THE DEVICE cudaMalloc((void**)&d_weights,weightsPerNeuron*sizeof(float)); //COPING ITS POINTER AS PART OF THE i-TH NEURONS STRUCT if(cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice)!=cudaSuccess) puts("Error: unable to copy weights' pointer to the device"); } //RETURN THE DEVICE POINTER return d_layer; } 

and I call this function from the main in this way (the kernel test is checked earlier):

 int main() { NLayer* nL; int h_tmp1; float h_tmp2; nL=setNLayer(10,12,13); test<<<1,1>>>(nL); if(cudaMemcpy(&h_tmp1,&nL->neurons[0].n_weights,sizeof(float),cudaMemcpyDeviceToHost)!=cudaSuccess); puts("ERROR!!"); printf("RESULT:%d",h_tmp1); } 

When I compile this code, the compiler shows me a warning, and when I run the program, it displays:

 Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device Error: unable to copy weights' pointer to the device ERROR!! RESULT:1 

The last error is not compared if I comment on the kernel call.

Where am I mistaken? I do not know how to do this. Thanks for your help!

+4
source share
2 answers

The problem is here:

 cudaMalloc((void**)&nL,sizeof(NLayer)); cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron)); 

In the first line, nL indicates the structure of global memory on the device. Therefore, in the second line, the first argument to cudaMalloc is the address located on the GPU, which is undefined (in my test system, this calls segfault, but in your case there is something more subtle).

The right way to do what you need is to first create a structure in the host memory, fill it with data, and then copy it to the device, for example:

 NLayer* nL; NLayer h_nL; int i; int tmp=9; // Allocate data on device cudaMalloc((void**)&nL, sizeof(NLayer)); cudaMalloc((void**)&h_nL.neurons, 6*sizeof(Neuron)); // Copy nlayer with pointers to device cudaMemcpy(nL, &h_nL, sizeof(NLayer), cudaMemcpyHostToDevice); 

Also, be sure to always check for errors in CUDA procedures.

UPDATE

In the second version of your code:

cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,...) --- again, you are the pointer of the dereference device ( d_layer ) on the host. You should use instead

 cudaMemcpy(&h_layer.neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice 

Here you take h_layer (node ​​structure), read its element ( h_layer.neurons ), which is a pointer to the device’s memory. Then you draw some kind of indicative arithmetic on it ( &h_layer.neurons[i].weights ). To calculate this address, access to the device’s memory is not required.

+5
source

It all depends on the GPU you are using. The Fermi card uses uniform addressing of the common and global memory space, while the cards do not have a pre-Fermi card.

In the case of pre-Fermi, you don’t know whether a common address or global address should. The compiler can usually understand this, but there are times when it cannot. When a pointer to shared memory is required, you usually take the address of a shared variable, and the compiler can recognize this. A message "suggesting global" will appear if it is not explicitly defined.

If you are using a GPU with computing power of 2.x or higher, it should work with the flag -arch = sm_20

0
source

All Articles