Is parallel sum calculation possible in OpenCL?

I am new to OpenCL. However, I understand the basics of C / C ++ and OOP. My question is this: is it possible to run the parallel task of calculating the sum? Is it theoretically possible? Below I will describe what I tried to do:

Task, for example:

double* values = new double[1000]; //let pretend it has some random values inside double sum = 0.0; for(int i = 0; i < 1000; i++) { sum += values[i]; } 

What I tried to do in the OpenCL kernel (and I think this is wrong, because maybe it accesses the same variable sum from different threads / tasks at the same time):

 __kernel void calculate2dim(__global float* vectors1dim, __global float output, const unsigned int count) { int i = get_global_id(0); output += vectors1dim[i]; } 

This code is invalid. I will be very grateful if someone answers me, if it is theoretically possible to carry out such tasks in parallel, and if it is - how!

+6
c ++ c parallel-processing opencl
source share
2 answers

If you want to summarize the values โ€‹โ€‹of your array in parallel, you must make sure that you reduce the number of conflicts and do not establish data dependencies on streams.

Data dependencies will make threads wait for each other, creating competition, which you want to avoid in order to get true parallelization.

One way to do this is to split the array into N arrays, each of which contains some subsection of your source array, and then calls your OpenCL kernel function with every other array.

In the end, when all the cores have done the hard work, you can simply summarize each array into one. This operation can be easily performed by the processor.

The key should not have a dependency between the calculations performed in each core, so you need to separate your data and processing.

I do not know if your data has any actual dependencies on your question, but you need to find out.

0
source share

The part of the code I provided for reference should do the job.

eg. you have elements N , and the size of your workgroup is WS = 64 . I believe that N is a multiple of 2 * WS (this is important, one working group calculates the sum of 2 * WS elements). Then you need to run the kernel definition:

 globalSizeX = 2*WS*(N/(2*WS)); 

As a result, the sum array will have partial sums of 2 * WS elements. (for example, sum [1] - will contain the sum of elements whose indices are from 2 * WS to 4 * WS-1 ).

If your globalSizeX is 2 * WS or less (which means that you have only one workgroup), you are done. Just use sum [0] . If not, you need to repeat the procedure, this time using the sum array as an input array and output to another array (create 2 arrays and a ping pong between them). And so on, until you have only one working group.

Search also for Hilli Steele / Blelloch parallel algorithms. This article may also be useful.

Here is an example:

 __kernel void par_sum(__global unsigned int* input, __global unsigned int* sum) { int li = get_local_id(0); int groupId = get_group_id(0); __local int our_h[2 * get_group_size(0)]; our_h[2*li + 0] = hist[2*get_group_size(0)*blockId + 2*li + 0]; our_h[2*li + 1] = hist[2*get_group_size(0)*blockId + 2*li + 1]; // sweep up int width = 2; int num_el = 2*get_group_size(0)/width; int wby2 = width>>1; for(int i = 2*BLK_SIZ>>1; i>0; i>>=1) { barrier(CLK_LOCL_MEM_FENCE); if(li < num_el) { int idx = width*(li+1) - 1; our_h[idx] = our_h[idx] + our_h[(idx - wby2)]; } width<<=1; wby2 = width>>1; num_el>>=1; } barrier(CLK_LOCL_MEM_FENCE); // down-sweep if(0 == li) sum[groupId] = our_h[2*get_group_size(0)-1]; // save sum } 
0
source share

All Articles