You have some problems in your code, for example:
int num = i + 2;
Giving thread 0 interaction 2, thread 1 iteration 3, etc. The problem is that the next iteration that the threads will calculate is based on num ++ ;. So this means that thread 0 will perform the next iteration 3, already done by thread 1. Therefore, you will have redundant computation. Moreover, I think that for this problem it will be easier to use only one dimension instead of 2 (x, y). Therefore, based on this assumption, you need to change num ++ to:
num += blockDim.x * gridDim.x;
Another problem is that you did not take into account that the variable counter should be shared between threads. Otherwise, each thread will try to find "n" primes, and all of them will fill the entire array. Therefore you need to change int counter = 0; for a general or global variable, we will use a global variable so that it is visible among all threads from all blocks. We can use the zero position of the device_primes array to store the counter.
In addition, you must initialize this value, you will give this task only to one thread. Let's give this task a stream with id = 0, therefore:
if (thread_id == 0) device_primes[0] = 1;
But since this variable is global and will be recorded by all threads, you must ensure that all threads, before writing to it, see that the counter is 1 (the first position is device_primes with primes, zero for the counter), so you should add also a barrier at the end, so that:
if (thread_id == 0) device_primes[0] = 1; __syncthreads()
So this is a possible solution (inefficient):
__global__ void getPrimes(int *device_primes,int n) { int c = 0; int thread_id = blockIdx.x * blockDim.x + threadIdx.x; int num = thread_id; if (thread_id == 0) device_primes[0] = 1; __syncthreads(); while(device_primes[0] < n) { for (c = 2; c <= num - 1; c++) { if (num % c == 0) //not prime { break; } } if (c == num) //prime { int pos = atomicAdd(&device_primes[0],1); device_primes[pos] = num; } num += blockDim.x * gridDim.x; // Next number for this thread } }
The next line is atomicAdd (& device_primes [0], 1); will basically do device_primes [0] ++; But since the counter is global, you must guarantee mutual exclusion. That is why I used this atomic operation. Note that you may need to compile the -arch sm_20 flag.
Optimization : As far as code is concerned, an approach with less / no synchronization would be preferable. You can also reduce the number of calculations, taking into account some of the advantages of primes, as you can see at http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes .