Pointers in structures passed to CUDA

Iโ€™ve been dealing with this for a long time, but I donโ€™t seem to understand. I am trying to copy objects containing arrays into the memory of a CUDA device (and vice versa, but I will go over this bridge when I get to it):

struct MyData { float *data; int dataLen; } void copyToGPU() { // Create dummy objects to copy int N = 10; MyData *h_items = new MyData[N]; for (int i=0; i<N; i++) { h_items[i].dataLen = 100; h_items[i].data = new float[100]; } // Copy objects to GPU MyData *d_items; int memSize = N * sizeof(MyData); cudaMalloc((void**)&d_items, memSize); cudaMemCpy(d_items, h_items, memSize, cudaMemcpyHostToDevice); // Run the kernel MyFunc<<<100,100>>>(d_items); } __global__ static void MyFunc(MyData *data) { int idx = blockIdx.x * blockDim.x + threadIdx.x; for (int i=0; i<data[idx].dataLen; i++) { // Do something with data[idx].data[i] } } 

When I call MyFunc (d_items), I can access the [idx] .dataLen data just fine. However, the [idx] .data data has not yet been copied.

I cannot use d_items.data in copyToGPU as the destination for cudaMalloc / cudaMemCpy operations, as the host code cannot dereference the device pointer.

What to do?

+4
source share
2 answers
  • device data distribution for all structures as a single array.
  • Copy continuous data from the host to the GPU.
  • customize graphic pointers.

Example:

 float *d_data; cudaMalloc((void**)&d_data, N*100*sizeof(float)); for (...) { h_items[i].data = i*100 + d_data; } 
+3
source

The code you provide only copies the MyData structures: the node address and integer. To be overly clear, you are copying a pointer, not data - you need to explicitly copy the data.

If the data is always LENGTH same, then you probably just want to make one large array:

 float *d_data; memSize = N * LENGTH * sizeof(float); cudaMalloc((void**) &d_data, memSize); //and a single copy cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice); 

If it should be in a structure with other data, then:

 struct MyData { float data[LENGTH]; int other_data; } MyData *d_items; memSize = N * sizeof(MyData); cudaMalloc((void**) &d_items, memSize); //and again a single copy cudaMemcpy(d_data, h_data, memSize, cudaMemcpyHostToDevice); 

But I assume that you have data representing many lengths. One solution is to set the maximum length of LENGTH (and just spend some space) and then do it the same way as described above. This may be the easiest way to get started, and then you can optimize later.

If you cannot afford the lost memory and transfer time, then I will have three arrays: one with all the data, and then one with offsets and one with the length for the host and device:

 //host memory float *h_data; int h_offsets[N], h_lengths[N]; //or allocate these dynamically if necessary int totalLength; //device memory float *d_data; int *d_offsets, *d_lengths; /* calculate totalLength, allocate h_data, and fill the three arrays */ //allocate device memory cudaMalloc((void**) &d_data, totalLength * sizeof(float)); cudaMalloc((void**) &d_ffsets, N * sizeof(int)); cudaMalloc((void**) &d_lengths, N * sizeof(int)); //and now three copies cudaMemcpy(d_data, h_data, totalLength * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_offsets, h_offsets, N * sizeof(int); cudaMemcpyHostToDevice); cudaMemcpy(d_lengths, h_lengths, N * sizeof(int); cudaMemcpyHostToDevice); 

Now in stream i you can find data that starts with d_data[d_offsets[i]] and has a length of d_data[d_lengths[i]]

+2
source

Source: https://habr.com/ru/post/1316275/


All Articles