The part of the code I provided for reference should do the job.
eg. you have elements N , and the size of your workgroup is WS = 64 . I believe that N is a multiple of 2 * WS (this is important, one working group calculates the sum of 2 * WS elements). Then you need to run the kernel definition:
globalSizeX = 2*WS*(N/(2*WS));
As a result, the sum array will have partial sums of 2 * WS elements. (for example, sum [1] - will contain the sum of elements whose indices are from 2 * WS to 4 * WS-1 ).
If your globalSizeX is 2 * WS or less (which means that you have only one workgroup), you are done. Just use sum [0] . If not, you need to repeat the procedure, this time using the sum array as an input array and output to another array (create 2 arrays and a ping pong between them). And so on, until you have only one working group.
Search also for Hilli Steele / Blelloch parallel algorithms. This article may also be useful.
Here is an example:
__kernel void par_sum(__global unsigned int* input, __global unsigned int* sum) { int li = get_local_id(0); int groupId = get_group_id(0); __local int our_h[2 * get_group_size(0)]; our_h[2*li + 0] = hist[2*get_group_size(0)*blockId + 2*li + 0]; our_h[2*li + 1] = hist[2*get_group_size(0)*blockId + 2*li + 1]; // sweep up int width = 2; int num_el = 2*get_group_size(0)/width; int wby2 = width>>1; for(int i = 2*BLK_SIZ>>1; i>0; i>>=1) { barrier(CLK_LOCL_MEM_FENCE); if(li < num_el) { int idx = width*(li+1) - 1; our_h[idx] = our_h[idx] + our_h[(idx - wby2)]; } width<<=1; wby2 = width>>1; num_el>>=1; } barrier(CLK_LOCL_MEM_FENCE); // down-sweep if(0 == li) sum[groupId] = our_h[2*get_group_size(0)-1]; // save sum }
alariq
source share