Right now, my GPU is slower than my processor when it comes to kernel runtime. I thought that since I tested a small sample, the CPU finished working faster due to lower startup overhead. However, when I tested the kernel with data nearly 10 times the size of the sample, the processor still ended faster and the GPU was almost 400 ms.
Runtime with file 2.39MB Processor: 43.511ms GPU: 65.219ms
Runtime with 32.9MB file Processor: 289.541ms GPU: 605.400 ms
I tried to use local memory, although I am 100% sure that I used it incorrectly, and ran into two problems. The kernel ends somewhere between 1000-3000 ms (depending on what size I set for localWorkSize), or I run the status code -5, which is CL_OUT_OF_RESOURCES.
Here is the core that the SO member helped me with.
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); float sum=0.0f; for (int i=0; i< 65; i++) { float tmp=0; if (globalId+i > 63) { tmp=Array[i+globalId-64]*coefficients[64-i]; } sum += tmp; } Output[globalId]=sum; }
It was my attempt to use local memory. The first bit will be a fragment of the host code, and the next part will be the kernel.
//Set the size of localMem status |= clSetKernelArg( kernel, 2, 1024, //I had num_items*(float) but it gave me a -5. Num items is the amount of elements in my array (around 1.2 million elements) null); printf("Kernel Arg output status: %i \n", status); //set a localWorkSize localWorkSize[0] = 64; //execute the kernel with localWorkSize included status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, &someEvent); //Here is what I did to the kernel*************************************** __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output, __local float *localMem) { int globalId = get_global_id(0); int localId = get_local_id(0); localMem[localId] = globalId[globalId]; float sum=0.0f; for (int i=0; i< 65; i++) { float tmp=0; if (globalId+i > 63) { tmp=localMem[i+localId-64]*coefficients[64-i]; } sum += tmp; } Output[globalId]=sum; }
The link link I used when trying to set local variables: How to use local memory in OpenCL?
The link used to search for kernelWorkGroupSize (so I have a 1024 set in kernelArg): CL_OUT_OF_RESOURCES for 2 million floats with 1GB VRAM?
I saw that other people have similar problems when the GPU is slower than the processor, but for many of them they use clEnqueueKernel instead of clEnqueueNDRangeKernel.
Here is my previous question if you need more information about this kernel: Best approach to implementing FIFO in the OpenCL core
Found some optimization tricks for the GPU. https://developer.amd.com/wordpress/media/2012/10/Optimizations-ImageConvolution1.pdf
Edited code Error still exists
__kernel void lowpass2(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); float sum=0.0f; float tmp=0.0f; for (int i=64-globalId; i< 65; i++) { tmp = 0.0f; tmp=Array[i]*coefficients[i]; sum += tmp; } Output[globalId]=sum; }