Purpose: Follow the diagram below in OpenCL. The main thing that is required from the OpenCl kernel is to multiply the coefficient array and the temp array, and then accumulate all these values ββin one at the end. (This is probably the most intensive operation, parallelism will be really useful here).
I use a helper function for the kernel that performs multiplication and addition (I hope this function will also be parallel).
Image Description:
One at a time, the values ββare passed to an array (temp array), whose size is equal to the size of the coefficient array. Now each time a single value is transferred to this array, the temp array is multiplied by an array of coefficients in parallel, and the values ββof each index are then combined into one separate element. This will continue until the input array reaches the final element.

What happens to my code?
For 60 elements from the input, it takes more than 8000 ms! and I have a total of 1.2 million resources that still need to be transferred. I know that there is a way to improve the decision I am making. Here is my code below.
Here are some things that I know are wrong with its code. When I try to multiply the coefficient values ββwith the temp array, it will work. This is because of global_id. All I want this line to do is just multiply two arrays in parallel.
I tried to understand why it took so long to execute the FIFO function, so I started commenting out the lines. First, I started by commenting out everything except the first for the FIFO function loop. As a result, it took 50 ms. Then, when I uncommented the next loop, it jumped to 8000 ms. Thus, the delay will be associated with data transmission.
Is there a register shift that I could use in OpenCl? Maybe use the logical shift method for whole arrays? (I know that there is a "β" operator).
float constant temp[58]; float constant tempArrayForShift[58]; float constant multipliedResult[58]; float fifo(float inputValue, float *coefficients, int sizeOfCoeff) { //take array of 58 elements (or same size as number of coefficients) //shift all elements to the right one //bring next element into index 0 from input //multiply the coefficient array with the array thats the same size of coefficients and accumilate //store into one output value of the output array //repeat till input array has reached the end int globalId = get_global_id(0); float output = 0.0f; //Shift everything down from 1 to 57 //takes about 50ms here for(int i=1; i<58; i++){ tempArrayForShift[i] = temp[i]; } //Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0. tempArrayForShift[0] = inputValue; //Takes about 8000ms with this loop included //Write values back into temp array for(int i=0; i<58; i++){ temp[i] = tempArrayForShift[i]; } //all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array //I am 100% sure this line is crashing the program. //multipliedResult[globalId] = coefficients[globalId] * temp[globalId]; //Sum the temp array with each other. Temp array consists of coefficients*fifo buffer for (int i = 0; i < 58; i ++) { // output = multipliedResult[i] + output; } //Returned summed value of temp array return output; } __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { //Initialize the temporary array values to 0 for (int i = 0; i < 58; i ++) { temp[i] = 0; tempArrayForShift[i] = 0; multipliedResult[i] = 0; } //fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE. for (int i = 0; i < 60; i ++) { Output[i] = fifo(Array[i], coefficients, 58); } }
I had a problem with OpenCl for a long time. I am not sure how to execute parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I thought about implementing the fifo buffer there, and the kernel did the multiplication and addition. But that would mean that I would have to call the kernel 1000+ times in a loop. Would this be the best solution? Or it will just be ineffective.