Break above head and control flag

Question

Break above head and control flag

I used a naive generated function. This code takes about 5.25 seconds to generate 10,000 primes (device_primes [0] contains the numbers of primes that are already found, the rest of the position found in primes).

_global__ void getPrimes(int *device_primes,int n) { int c = 0; int thread_id = blockIdx.x * blockDim.x + threadIdx.x; int num = thread_id+2; if (thread_id == 0) device_primes[0] = 1; __syncthreads(); while(device_primes[0] < n) { for (c = 2; c <= num - 1; c++) { if (num % c == 0) //not prime { break; } } if (c == num) //prime { int pos = atomicAdd(&device_primes[0],1); device_primes[pos] = num; } num += blockDim.x * gridDim.x; // Next number for this thread } }

I was just starting to optimize the code, and I made the following modification, not:

 for (c = 2; c <= num - 1; c++) { if (num % c == 0) //not prime break; } if (c == num) {...}

Now I have:

  int prime = 1; ... for (c = 2; c <= num - 1 && prime; c++) { if (num % c == 0) prime = 0; // not prime } if (prime) {...} // if prime

Now I can generate 10k in 0.707s. I just wondered why such a speed breaks with this simple modification, which is bad?

+8

performance cuda

dreamcrash Nov 05 '12 at 16:38

source share

1 answer

Eric Olson · Accepted Answer · 2012-11-25T23:07:15+0000

As Tony suggested, divergent code execution can cause serious slowdowns in gpu code, forcing some code to work in serial rather than parallel. In the slow version of the above code, threads that fall into the gap diverge from the ongoing code.

cuda c programming guide is a good resource for gpu programming methods. Here's what the control flow says :

Any flow control command (if, switch, do, for, while) can significantly affect the effective bandwidth of the commands, causing the threads of the same warp to diverge (i.e. follow different execution paths). If this happens, different execution paths must be serialized, which increases the total number of commands executed for this warp. When all execution paths are complete, threads converge to the same execution path.

The new nvidia hardware and cuda versions may handle some branching a little better than older versions, but it is better to avoid branching when possible.

Break above head and control flag

More articles: