A faster way to structure offset neighborhood operations in OpenCL

How can I organize an operation on many overlapping but offset blocks of a 2D array for more efficient execution in OpenCL?

For example, I have the following OpenCL core:

__kernel void test_kernel( read_only image2d_t src, write_only image2d_t dest, const int width, const int height ) { const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; int2 pos = (int2)(get_global_id(0), get_global_id(1)); int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16); uint4 diff = (uint4)(0, 0, 0, 0); for (int i=0; i<16; i++) { for (int j=0; j<16; j++) { diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) - read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)); } } write_imageui(dest, pos, diff); } 

It produces correct results, but slow ... only ~ 25 GFLOPS on the NVS4200M with 1k to 1k input. (Hardware Specification - 155 GFLOPS). I assume this is due to memory access patterns. Each work item reads one 16x16 data block, which is the same as all its neighbors in the 16x16 area, as well as another data offset block, which overlaps most of its time with its nearest neighbors. All readings pass through samplers. The host program is PyOpenCL (I don’t think that it really changes anything), and the size of the workgroup is 16x16.

EDIT . The new version of the kernel as suggested below, copy the workspace to local variables:

 __kernel __attribute__((reqd_work_group_size(16, 16, 1))) void test_kernel( read_only image2d_t src, write_only image2d_t dest, const int width, const int height ) { const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; int2 pos = (int2)(get_global_id(0), get_global_id(1)); int dx = pos.x % 16; int dy = pos.y % 16; __local uint4 local_src[16*16]; __local uint4 local_src2[32*32]; local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos); local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos); local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y)); local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16)); local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16)); barrier(CLK_LOCAL_MEM_FENCE); uint4 diff = (uint4)(0, 0, 0, 0); for (int i=0; i<16; i++) { for (int j=0; j<16; j++) { diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ]; } } write_imageui(dest, pos, diff); } 

Result: the output is correct, the operating time is 56% slower. If only local_src is used (not local_src2), the result will be ~ 10% faster.

EDIT . Compared to much more powerful hardware, the AMD Radeon HD 7850 gets 420GFLOPS, specification 1751GFLOPS. In fairness, it should be said that the specification is intended to be added multiple times, and there is no reproduction here, so it is expected to be ~ 875GFLOPS, but this is still quite a lot compared to theoretical performance.

EDIT . To facilitate the execution of tests for those who would like to try this, the host-side program in PyOpenCL is below:

 import pyopencl as cl import numpy import numpy.random from time import time CL_SOURCE = ''' // kernel goes here ''' ctx = cl.create_some_context() queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE) prg = cl.Program(ctx, CL_SOURCE).build() h, w = 1024, 1024 src = numpy.zeros((h, w, 4), dtype=numpy.uint8) src[:,:,:] = numpy.random.rand(h, w, 4) * 255 mf = cl.mem_flags src_buf = cl.image_from_array(ctx, src, 4) fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8) dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h)) # warmup for n in range(10): event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h)) event.wait() # benchmark t1 = time() for n in range(100): event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h)) event.wait() t2 = time() print "Duration (host): ", (t2-t1)/100 print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9 

EDIT : reflecting on memory access patterns, the original naive version might be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work items in the workgroup read the same location (so is this just one read?), and when by calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they read sequential addresses (so that reading can be combined fine?).

+4
source share
2 answers

This is definitely a memory access issue. The adjacent pixels of the work items can overlap by as much as 15x16, and even worse, each work item will overlap at least 225 others.

I would use local memory and get workgroups to handle many 16x16 blocks together. I like to use a large square block for each workgroup. Rectangular blocks are a bit trickier, but can improve memory usage.

If you read blocks of n by n pixels from the original image, the borders will overlap nx15 (or 15xn). You need to calculate the maximum possible value for the n base in your available local memory (LDS) size. If you use opencl 1.1 or higher, the LDS is at least 32 KB. opencl 1.0 promises 16kb per workgroup.

 n <= sqrt(32kb / sizeof(uint4)) n <= sqrt(32768 / 16) n ~ 45 

Using n = 45 will use 32,400 of 32768 LDS bytes and allow you to use 900 work items for each group (45-15) ^ 2 = 900. Note: here, where a rectangular block will help; for example, 64x32 will use all LDS, but with group size = (64-15) * (32-15) = 833.

steps to use LDS for your kernel:

  • allocates a 1D or 2D local array for your cached image block. I use the constant #define, and it rarely has to be changed.
  • read uint values ​​from your image and save locally.
  • adjust the 'pos' for each work item to communicate with local memory
  • do the same i, j loops you use, but using local memory to read the values. remember that cycles i and j stop 15 less than n.

Each step can be found on the Internet if you do not know how to implement it, or you can ask me if you need a hand.

Most likely, LDS on your device will exceed the speed of reading textures. This is counterintuitive, but remember that you read tiny amounts of data at a time, so gpu will not be able to cache pixels efficiently. Using LDS ensures that pixels are available, and given the number of times each pixel is read, I expect this to be of great importance.

Please let me know what results you are observing.

UPDATE: Here is my attempt to better explain my decision. I used graphics for my drawings because I'm not so good at using image processing software.

How values ​​were originally from 'src'

The above is an example of how the values ​​were read from src in your first code snippet. The big problem is that the pos0 rectangle - 16x16 uint4 values ​​- is read as a whole for each work item in the group (256 of them). My solution includes reading a large area and sharing data for all 256 workgroups.

enter image description here

If you save an image area of ​​31x31 in local memory, all data from 256 work items will be available.

actions:

  • use the size of the working group: (16.16)
  • read the src values ​​in a large local buffer, that is: uint4 buff [31] [31]; The buffer must be translated so that 'pos0' is in buff [0] [0]
  • (CLK_LOCAL_MEM_FENCE) to wait for memory copy operations.
  • do the same i, j for the loops you originally had, except that you don't take into account the pos and pos0 values. use only i and j for location. Accumulate "diff" the same way you did it originally.
  • write a solution for 'dest'

This is the same as my first answer to your question, except that I use n = 16. This value does not completely use local memory, but will probably work well for most platforms. 256 tends to be the overall maximum workgroup size.

Hope this all makes it easier for you.

+6
source

Some suggestions:

  • Compute more than one output pixel in each work item. This will increase data reuse.
  • Control the different sizes of the workgroup to make the most of the texture cache.
  • Perhaps there is a way to divide the core into two passes (horizontally and vertically).

Update : more offers

Instead of loading everything in local memory, try loading only local_src values ​​and use read_image for the other.

Since you almost do not calculate, you should measure the read speed in GB / s format and compare with the maximum memory speed.

+1
source

All Articles