Optimizing kernel code in opencl for the GPU

Right now, my GPU is slower than my processor when it comes to kernel runtime. I thought that since I tested a small sample, the CPU finished working faster due to lower startup overhead. However, when I tested the kernel with data nearly 10 times the size of the sample, the processor still ended faster and the GPU was almost 400 ms.

Runtime with file 2.39MB Processor: 43.511ms GPU: 65.219ms

Runtime with 32.9MB file Processor: 289.541ms GPU: 605.400 ms

I tried to use local memory, although I am 100% sure that I used it incorrectly, and ran into two problems. The kernel ends somewhere between 1000-3000 ms (depending on what size I set for localWorkSize), or I run the status code -5, which is CL_OUT_OF_RESOURCES.

Here is the core that the SO member helped me with.

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); float sum=0.0f; for (int i=0; i< 65; i++) { float tmp=0; if (globalId+i > 63) { tmp=Array[i+globalId-64]*coefficients[64-i]; } sum += tmp; } Output[globalId]=sum; } 

It was my attempt to use local memory. The first bit will be a fragment of the host code, and the next part will be the kernel.

 //Set the size of localMem status |= clSetKernelArg( kernel, 2, 1024, //I had num_items*(float) but it gave me a -5. Num items is the amount of elements in my array (around 1.2 million elements) null); printf("Kernel Arg output status: %i \n", status); //set a localWorkSize localWorkSize[0] = 64; //execute the kernel with localWorkSize included status = clEnqueueNDRangeKernel( cmdQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, &someEvent); //Here is what I did to the kernel*************************************** __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output, __local float *localMem) { int globalId = get_global_id(0); int localId = get_local_id(0); localMem[localId] = globalId[globalId]; float sum=0.0f; for (int i=0; i< 65; i++) { float tmp=0; if (globalId+i > 63) { tmp=localMem[i+localId-64]*coefficients[64-i]; } sum += tmp; } Output[globalId]=sum; } 

The link link I used when trying to set local variables: How to use local memory in OpenCL?

The link used to search for kernelWorkGroupSize (so I have a 1024 set in kernelArg): CL_OUT_OF_RESOURCES for 2 million floats with 1GB VRAM?

I saw that other people have similar problems when the GPU is slower than the processor, but for many of them they use clEnqueueKernel instead of clEnqueueNDRangeKernel.

Here is my previous question if you need more information about this kernel: Best approach to implementing FIFO in the OpenCL core

Found some optimization tricks for the GPU. https://developer.amd.com/wordpress/media/2012/10/Optimizations-ImageConvolution1.pdf

Edited code Error still exists

 __kernel void lowpass2(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); float sum=0.0f; float tmp=0.0f; for (int i=64-globalId; i< 65; i++) { tmp = 0.0f; tmp=Array[i]*coefficients[i]; sum += tmp; } Output[globalId]=sum; } 
+6
source share
2 answers

Launching Next Kernel for 24 Million Arrays of Elements

 __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); float sum=0.0f; for (int i=0; i< 65; i++) { float tmp=0; if (globalId+i > 63) { tmp=Array[i+globalId-64]*coefficients[64-i]; } sum += tmp; } Output[globalId]=sum; } 

It completes within 200 ms for a device pool with 25 units of calculation, but more than 500 ms for a processor with 8 cores.

Either you have a high-end CPU, or low-end gpu, or the gpu driver was gimped, or the gpu pci-e interface is stuck in the pci-e band 1.1 @ 4x, so copy instances between the host and device are limited.

On the other hand, this optimized version:

 __kernel void lowpass(__global __read_only float *Array,__constant float *coefficients, __global __write_only float *Output) { int globalId = get_global_id(0); float sum=0.0f; int min_i= max(64,globalId)-64; int max_i= min_i+65; for (int i=min_i; i< max_i; i++) { sum +=Array[i]*coefficients[globalId-i]; } Output[globalId]=sum; } 

has less than 150 ms for the processor (8 computing units) and less than 80 ms for calculating the gpu calculation time (25 computing units). Work on the subject only 65 times. This small number of operations can be very easily accelerated using the __constant and __read_only and __write_only parameter specifiers and reducing integer work.

Using float4 instead of the float type for Array and Output should increase the speed by% 80 for your processor and gpu, since these are SIMD modules and vector computing units.

Bottlenecks of this kernel:

  • Only 65 multiplications and 65 sums per stream.
  • But still, the data moves through the pci-express interface, slowly.
  • Also, 1 conditional check (i <max_i) for the operation with the float is high; a cycle reversal is required.
  • All that is scalar, although your processor and gpu are vector based.

Generally:

  • Starting the kernel for the first time starts just in time, the opencl optimization compiler, slow. Perform at least 5-10 times for the exact time.
  • __ persistent space is only 10 - 100 kB, but it is faster than __global and is suitable for the amd hd5000 series.
  • The kernel overhead is 100 microseconds, while 65 cache operations are less than this and are obscured by the kernel overhead (and even worse, the pci-e delay).
  • Too few work items makes the employment rate less slow.

also:

  • The 4-core Xeon @ 3 GHz is much faster than 16 (1/4 of vliw5) * 2 (computing units) = 32 gpu @ 600 MHz cores due to branch prediction, total cache bandwidth, instruction latency and lack of pcie latency .
  • The amd series HD5000 cards are obsolete, the same as gimped.
  • The HD5450 has a 166 GB / s continuous memory bandwidth.
  • Which also has a bandwidth of LDS (local memory) of 83 GB / s
  • Which also has 83 GB / s L1 and L2 cache bandwidth, so just let it work on optimizing the __global driver instead of LDS if you don't plan on updating your computer. (for Array of the course) Perhaps the odd elements from LDS, even elements from __global, can have a width of 83 + 83 = 166 GB / s. You can try. Perhaps two or two is better than alternating bank conflicts.

  • Using factors like __constant (166 GB / s) and an array like __global should give you 166 + 83 = 249 GB / s combined bandwidth.

  • Each coefficient element is used only once for the stream, so I do not propose using private registers (499 GB / s)

+5
source

Before entering local memory, first move the if from the loop:

 __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); float sum=0.0f; int start = 0; if(globalId < 64) start = 64-globalId; for (int i=start; i< 65; i++) sum += Array[i+globalId-64] * coefficients[64-i]; Output[globalId]=sum; } 

Then the introduction of local memory can be implemented as follows:

 __kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { int globalId = get_global_id(0); int local_id = get_local_id(0); __local float local_coefficients[65]; __local float local_array[2*65]; local_coefficient[local_id] = coefficients[local_id]; if(local_id == 0) local_coefficient[64] = coefficients[64]; for (int i=0; i< 2*65; i+=get_local_size(0)) { if(i+local_id < 2*65) local_array[i+local_id] = Array[i+global_id]; } barrier(CLK_LOCAL_MEM_FENCE); float sum=0.0f; int start = 0; if(globalId < 64) start = 64-globalId; for (int i=start; i< 65; i++) sum += local_array[i+local_id] * local_coefficient[64-i]; Output[globalId]=sum; } 

PS There may be some errors, such as recalculations with global and local indices, etc. (Now I'm going to sleep :)) However, on the implementation, you should enter the start of work with local memory in the right direction.

+3
source

All Articles