CUDA Kernel - Nested Loop

Question

CUDA Kernel - Nested Loop

Hi I am trying to write a CUDA kernel to execute the following code snippet.

for (n = 0; n < (total-1); n++) { a = values[n]; for ( i = n+1; i < total ; i++) { b = values[i] - a; c = b*b; if( c < 10) newvalues[i] = c; } }

This is what I have now, but it looks like it is not giving the right results? Does anyone know what I'm doing wrong. Greetings

 __global__ void calc(int total, float *values, float *newvalues){ float a,b,c; int idx = blockIdx.x * blockDim.x + threadIdx.x; for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x){ a = values[n]; for(int i = n+1; i < total; i++){ b = values[i] - a; c = b*b; if( c < 10) newvalues[i] = c; } }

+6

cuda

Roger Mar 15 '11 at 0:11

source share

3 answers

I'm probably mistaken, but n < (total-1) check

 for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x)

seems to be different from the original version.

0

ypercubeᵀᴹ Mar 15 '11 at 0:25

source share

Why don't you just delete the outter loop and start the kernel with as many threads as you need for this loop? It's a little weird to have a loop that depends on your blockId. Usually you try to avoid these loops. Secondly, it seems to me that newvalues[i] can be redefined by different threads.

0

moggi Mar 15 '11 at 1:07

source share

jwdmsd · Accepted Answer · 2011-03-15T07:07:58+0000

Implement this problem in 2D and start your kernel using two-dimensional blocks. The total number of threads in sizes x and y will be equal to total . The kernel code should look like this:

 __global__ void calc(float *values, float *newvalues, int total){ float a,b,c; int n= blockIdy.y * blockDim.y + threadIdx.y; int i= blockIdx.x * blockDim.x + threadIdx.x; if (n>=total || i>=total) return; a = values[n]; b = values[i] - a; c = b*b; if( c < 10) newvalues[i] = c; // I don't know your problem statement but i think it should be like: newvalues[n*total+i] = c; }

Update:

This is how you should call the kernel

 dim3 block(16,16); dim3 grid ( (total+15)/16, (total+15)/16 ); calc<<<grid,block>>>(float *val, float *newval, int T);

Also make sure you add this line to the kernel (see updated kernel)

 if (n>=total || i>=total) return;

CUDA Kernel - Nested Loop

More articles: