How to Speed ​​Up Metal Code for iOS / Mac OS

I am trying to implement code in Metal that does 1D convolution between two vectors with a length. I implemented the following, which works correctly

kernel void convolve(const device float *dataVector [[ buffer(0) ]], const device int& dataSize [[ buffer(1) ]], const device float *filterVector [[ buffer(2) ]], const device int& filterSize [[ buffer(3) ]], device float *outVector [[ buffer(4) ]], uint id [[ thread_position_in_grid ]]) { int outputSize = dataSize - filterSize + 1; for (int i=0;i<outputSize;i++) { float sum = 0.0; for (int j=0;j<filterSize;j++) { sum += dataVector[i+j] * filterVector[j]; } outVector[i] = sum; } } 

My problem is that it needs to process (calculation + transferring data to / from the GPU) about 10 times longer than the same data as Metal than in Swift on the processor. My question is how to replace the inner loop with one vector operation, or is there another way to speed up the above code?

+6
source share
1 answer

The key to using parallelism GPUs in this case is controlling the outer loop for you. Instead of calling the kernel once for the entire data vector, we will call it for each element in the data vector. The kernel function simplifies this:

 kernel void convolve(const device float *dataVector [[ buffer(0) ]], const constant int &dataSize [[ buffer(1) ]], const constant float *filterVector [[ buffer(2) ]], const constant int &filterSize [[ buffer(3) ]], device float *outVector [[ buffer(4) ]], uint id [[ thread_position_in_grid ]]) { float sum = 0.0; for (int i = 0; i < filterSize; ++i) { sum += dataVector[id + i] * filterVector[i]; } outVector[id] = sum; } 

To submit this work, we select the size of the thread group based on the thread execution width recommended by the pipelined state. One difficult task is to make sure that the input and output buffer is enough to fill, so that we can slightly overload the actual size of the data. This forces us to waste a small amount of memory and computation, but saves us the complexity of sending separately to calculate the convolution for the elements at the end of the buffer.

 // We should ensure here that the data buffer and output buffer each have a size that is a multiple of // the compute pipeline threadExecutionWidth, by padding the amount we allocate for each of them. // After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1). let iterationCount = dataCount - filterCount + 1 let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1) let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1) let commandEncoder = commandBuffer.computeCommandEncoder() commandEncoder.setComputePipelineState(computePipeline) commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0) commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1) commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2) commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3) commandEncoder.setBuffer(outBuffer, offset: 0, at: 4) commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup) commandEncoder.endEncoding() 

In my experiments, this parallel approach works 400-1000x faster than the serial version in question. I am curious to know how it compares with the implementation of your processor.

+10
source

All Articles