CUDA parallel loop parallelization

Question

CUDA parallel loop parallelization

I am new to CUDA. I am trying to parallelize the following code. Now it sits on the core, but does not use threads at all, thus slowly. I tried to use this answer , but haven’t understood anything yet.

The kernel should generate the first n prime numbers, put them in the device_primes array, and this array later gets access from the host. The code is correct and works fine in the production version, but I need to speed it up, possibly using shared memory.

//CUDA kernel code __global__ void generatePrimes(int* device_primes, int n) { //int i = blockIdx.x * blockDim.x + threadIdx.x; //int j = blockIdx.y * blockDim.y + threadIdx.y; int counter = 0; int c = 0; for (int num = 2; counter < n; num++) { for (c = 2; c <= num - 1; c++) { if (num % c == 0) //not prime { break; } } if (c == num) //prime { device_primes[counter] = num; counter++; } } }

My current, preliminary, and definitely incorrect attempt to parallelize this is as follows:

 //CUDA kernel code __global__ void generatePrimes(int* device_primes, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int num = i + 2; int c = j + 2; int counter = 0; if ((counter >= n) || (c > num - 1)) { return; } if (num % c == 0) //not prime { } if (c == num) //prime { device_primes[counter] = num; counter++; } num++; c++; }

But this code fills the array with data that doesn't make sense. In addition, many values are zeros. Thanks in advance for any help, this was appreciated.

+6

c ++ for-loop nested gpu cuda

Nikita K Nov 04 '12 at 2:53

source share

1 answer

dreamcrash · Accepted Answer · 2012-11-04T19:35:49+0000

You have some problems in your code, for example:

 int num = i + 2;

Giving thread 0 interaction 2, thread 1 iteration 3, etc. The problem is that the next iteration that the threads will calculate is based on num ++ ;. So this means that thread 0 will perform the next iteration 3, already done by thread 1. Therefore, you will have redundant computation. Moreover, I think that for this problem it will be easier to use only one dimension instead of 2 (x, y). Therefore, based on this assumption, you need to change num ++ to:

 num += blockDim.x * gridDim.x;

Another problem is that you did not take into account that the variable counter should be shared between threads. Otherwise, each thread will try to find "n" primes, and all of them will fill the entire array. Therefore you need to change int counter = 0; for a general or global variable, we will use a global variable so that it is visible among all threads from all blocks. We can use the zero position of the device_primes array to store the counter.

In addition, you must initialize this value, you will give this task only to one thread. Let's give this task a stream with id = 0, therefore:

 if (thread_id == 0) device_primes[0] = 1;

But since this variable is global and will be recorded by all threads, you must ensure that all threads, before writing to it, see that the counter is 1 (the first position is device_primes with primes, zero for the counter), so you should add also a barrier at the end, so that:

 if (thread_id == 0) device_primes[0] = 1; __syncthreads()

So this is a possible solution (inefficient):

 __global__ void getPrimes(int *device_primes,int n) { int c = 0; int thread_id = blockIdx.x * blockDim.x + threadIdx.x; int num = thread_id; if (thread_id == 0) device_primes[0] = 1; __syncthreads(); while(device_primes[0] < n) { for (c = 2; c <= num - 1; c++) { if (num % c == 0) //not prime { break; } } if (c == num) //prime { int pos = atomicAdd(&device_primes[0],1); device_primes[pos] = num; } num += blockDim.x * gridDim.x; // Next number for this thread } }

The next line is atomicAdd (& device_primes [0], 1); will basically do device_primes [0] ++; But since the counter is global, you must guarantee mutual exclusion. That is why I used this atomic operation. Note that you may need to compile the -arch sm_20 flag.

Optimization : As far as code is concerned, an approach with less / no synchronization would be preferable. You can also reduce the number of calculations, taking into account some of the advantages of primes, as you can see at http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes .

CUDA parallel loop parallelization

More articles: