I used a naive generated function. This code takes about 5.25 seconds to generate 10,000 primes (device_primes [0] contains the numbers of primes that are already found, the rest of the position found in primes).
_global__ void getPrimes(int *device_primes,int n) { int c = 0; int thread_id = blockIdx.x * blockDim.x + threadIdx.x; int num = thread_id+2; if (thread_id == 0) device_primes[0] = 1; __syncthreads(); while(device_primes[0] < n) { for (c = 2; c <= num - 1; c++) { if (num % c == 0) //not prime { break; } } if (c == num) //prime { int pos = atomicAdd(&device_primes[0],1); device_primes[pos] = num; } num += blockDim.x * gridDim.x; // Next number for this thread } }
I was just starting to optimize the code, and I made the following modification, not:
for (c = 2; c <= num - 1; c++) { if (num % c == 0) //not prime break; } if (c == num) {...}
Now I have:
int prime = 1; ... for (c = 2; c <= num - 1 && prime; c++) { if (num % c == 0) prime = 0; // not prime } if (prime) {...} // if prime
Now I can generate 10k in 0.707s. I just wondered why such a speed breaks with this simple modification, which is bad?
performance cuda
dreamcrash
source share