The best approach to implementing FIFO in the OpenCL core

Question

The best approach to implementing FIFO in the OpenCL core

Purpose: Follow the diagram below in OpenCL. The main thing that is required from the OpenCl kernel is to multiply the coefficient array and the temp array, and then accumulate all these values in one at the end. (This is probably the most intensive operation, parallelism will be really useful here).

I use a helper function for the kernel that performs multiplication and addition (I hope this function will also be parallel).

Image Description:

One at a time, the values are passed to an array (temp array), whose size is equal to the size of the coefficient array. Now each time a single value is transferred to this array, the temp array is multiplied by an array of coefficients in parallel, and the values of each index are then combined into one separate element. This will continue until the input array reaches the final element.

What happens to my code?

For 60 elements from the input, it takes more than 8000 ms! and I have a total of 1.2 million resources that still need to be transferred. I know that there is a way to improve the decision I am making. Here is my code below.

Here are some things that I know are wrong with its code. When I try to multiply the coefficient values with the temp array, it will work. This is because of global_id. All I want this line to do is just multiply two arrays in parallel.

I tried to understand why it took so long to execute the FIFO function, so I started commenting out the lines. First, I started by commenting out everything except the first for the FIFO function loop. As a result, it took 50 ms. Then, when I uncommented the next loop, it jumped to 8000 ms. Thus, the delay will be associated with data transmission.

Is there a register shift that I could use in OpenCl? Maybe use the logical shift method for whole arrays? (I know that there is a "→" operator).

float constant temp[58]; float constant tempArrayForShift[58]; float constant multipliedResult[58]; float fifo(float inputValue, float *coefficients, int sizeOfCoeff) { //take array of 58 elements (or same size as number of coefficients) //shift all elements to the right one //bring next element into index 0 from input //multiply the coefficient array with the array thats the same size of coefficients and accumilate //store into one output value of the output array //repeat till input array has reached the end int globalId = get_global_id(0); float output = 0.0f; //Shift everything down from 1 to 57 //takes about 50ms here for(int i=1; i<58; i++){ tempArrayForShift[i] = temp[i]; } //Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0. tempArrayForShift[0] = inputValue; //Takes about 8000ms with this loop included //Write values back into temp array for(int i=0; i<58; i++){ temp[i] = tempArrayForShift[i]; } //all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array //I am 100% sure this line is crashing the program. //multipliedResult[globalId] = coefficients[globalId] * temp[globalId]; //Sum the temp array with each other. Temp array consists of coefficients*fifo buffer for (int i = 0; i < 58; i ++) { // output = multipliedResult[i] + output; } //Returned summed value of temp array return output; } __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { //Initialize the temporary array values to 0 for (int i = 0; i < 58; i ++) { temp[i] = 0; tempArrayForShift[i] = 0; multipliedResult[i] = 0; } //fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE. for (int i = 0; i < 60; i ++) { Output[i] = fifo(Array[i], coefficients, 58); } }

I had a problem with OpenCl for a long time. I am not sure how to execute parallel and sequential instructions together.

Another alternative I was thinking about

In the main cpp file, I thought about implementing the fifo buffer there, and the kernel did the multiplication and addition. But that would mean that I would have to call the kernel 1000+ times in a loop. Would this be the best solution? Or it will just be ineffective.

+3

c opencl fifo

Vedhar May 24 '16 at 18:46

source share

2 answers

Here you can try another kernel. There are many synchronization points (barriers), but this should work quite well. A working group with 65 subjects is not very optimal.

Steps:

initialize local values to 0
Copy coefficients to a local variable

loop output elements to calculate:

move existing items (work items only> 0)
copy new item (work item 0 only)
calculate product point
5a. multiplication - one per product 5b. reduction cycle to calculate the amount
copy product point to output (only WI 0)
final barrier

the code:

 __kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){ int globalId = get_global_id(0); int localId = get_local_id(0); int localSize = get_local_size(0); //1 init local values to 0 localArray[localId] = 0.0f //2 copy coefficients to local //don't bother with this id __constant is working for you //requires another local to be passed in: localCoeff //localCoeff[localId] = coefficients[localId]; //barrier for both steps 1 and 2 barrier(CLK_LOCAL_MEM_FENCE); float tmp; for(int i = 0; i< outputSize; i++) { //3 shift elements (+barrier) if(localId > 0){ tmp = localArray[localId -1] } barrier(CLK_LOCAL_MEM_FENCE); localArray[localId] = tmp //4 copy new element (work item 0 only, + barrier) if(localId == 0){ localArray[0] = Array[i]; } barrier(CLK_LOCAL_MEM_FENCE); //5 compute dot product //5a multiply + barrier localSums[localId] = localArray[localId] * coefficients[localId]; barrier(CLK_LOCAL_MEM_FENCE); //5b reduction loop + barrier for(int j = 1; j < localSize; j <<= 1) { int mask = (j << 1) - 1; if ((localId & mask) == 0) { localSums[local_index] += localSums[localId +j] } barrier(CLK_LOCAL_MEM_FENCE); } //6 copy dot product (WI 0 only) if(localId == 0){ Output[i] = localSums[0]; } //7 barrier //only needed if there is more code after the loop. //the barrier in #3 covers this in the case where the loop continues //barrier(CLK_LOCAL_MEM_FENCE); } }

How about extra workgroups?
This is simplified a bit to allow a single 1x65 workgroup computer all 1.2M output. To allow multiple workgroups, you can use / get _num_groups (0) to calculate the amount of work each group should do (workAmount), and set up my for-loop:

 for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)

Step # 1 must also be modified to initialize the correct initial state for localArray, and not all 0.

  //1 init local values if(groupId == 0){ localArray[localId] = 0.0f }else{ localArray[localSize - localId] = Array[workAmount - localId]; }

These two changes should allow you to use a more optimal number of workgroups; I propose several multiples of the number of computing devices on the device. Try to keep the workload for each group in the thousands. Play with it, sometimes what seems to be optimal at a high level will be harmful to the kernel when it starts.

Benefits
At almost every point in this kernel, work items must do something. The only time that less than 100% of the items are running is during the recovery cycle in step 5b. Read more about why this is good.

disadvantages
Barriers slow down the core only by the nature of the barriers: pause the work item until others reach this point. You may be able to implement this with fewer barriers, but I still think that this is optimal due to the problem you are trying to solve.
There is no room for more work items in the group, and 65 is not a very optimal size. Ideally, you should try to use power 2 or 64. This will not be a big problem, because there are many obstacles in the kernel, which makes them wait quite regularly.

+1

mfa May 30 '16 at 20:31

source share

Jan lucas · Accepted Answer · 2016-05-25T16:17:18+0000

To get good performance from the GPU, you need to parallelize your work with many threads. You only use one thread in your code, and the GPU is very slow for each thread, but can be very fast if there are many threads running at the same time. In this case, you can use one stream for each output value. In fact, you do not need to change values through an array: for each output value, a value of 58 values is considered, you can simply extract these values from memory, multiply them by coefficients and write the result.

A simple implementation would be (starting with as many threads as the output values):

 __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); float sum=0.0f; for (int i=0; i< 58; i++) { float tmp=0; if (globalId+i > 56) { tmp=Array[i+globalId-57]*coefficient[57-i]; } sum += tmp; } output[globalId]=sum; }

This is not ideal, because the memory access patterns they create are not optimal for GPUs. The cache is likely to help a little, but there are many opportunities for optimization, since the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has a 2D example called oclConvolutionSeparable in its Computing SDK, which shows an optimized version. You adapt the use of convolutionRows kernel for one-dimensional convolution.

The best approach to implementing FIFO in the OpenCL core

More articles: