The key to using parallelism GPUs in this case is controlling the outer loop for you. Instead of calling the kernel once for the entire data vector, we will call it for each element in the data vector. The kernel function simplifies this:
kernel void convolve(const device float *dataVector [[ buffer(0) ]], const constant int &dataSize [[ buffer(1) ]], const constant float *filterVector [[ buffer(2) ]], const constant int &filterSize [[ buffer(3) ]], device float *outVector [[ buffer(4) ]], uint id [[ thread_position_in_grid ]]) { float sum = 0.0; for (int i = 0; i < filterSize; ++i) { sum += dataVector[id + i] * filterVector[i]; } outVector[id] = sum; }
To submit this work, we select the size of the thread group based on the thread execution width recommended by the pipelined state. One difficult task is to make sure that the input and output buffer is enough to fill, so that we can slightly overload the actual size of the data. This forces us to waste a small amount of memory and computation, but saves us the complexity of sending separately to calculate the convolution for the elements at the end of the buffer.
In my experiments, this parallel approach works 400-1000x faster than the serial version in question. I am curious to know how it compares with the implementation of your processor.
source share