How can I organize an operation on many overlapping but offset blocks of a 2D array for more efficient execution in OpenCL?
For example, I have the following OpenCL core:
__kernel void test_kernel( read_only image2d_t src, write_only image2d_t dest, const int width, const int height ) { const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; int2 pos = (int2)(get_global_id(0), get_global_id(1)); int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16); uint4 diff = (uint4)(0, 0, 0, 0); for (int i=0; i<16; i++) { for (int j=0; j<16; j++) { diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) - read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)); } } write_imageui(dest, pos, diff); }
It produces correct results, but slow ... only ~ 25 GFLOPS on the NVS4200M with 1k to 1k input. (Hardware Specification - 155 GFLOPS). I assume this is due to memory access patterns. Each work item reads one 16x16 data block, which is the same as all its neighbors in the 16x16 area, as well as another data offset block, which overlaps most of its time with its nearest neighbors. All readings pass through samplers. The host program is PyOpenCL (I donβt think that it really changes anything), and the size of the workgroup is 16x16.
EDIT . The new version of the kernel as suggested below, copy the workspace to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1))) void test_kernel( read_only image2d_t src, write_only image2d_t dest, const int width, const int height ) { const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST; int2 pos = (int2)(get_global_id(0), get_global_id(1)); int dx = pos.x % 16; int dy = pos.y % 16; __local uint4 local_src[16*16]; __local uint4 local_src2[32*32]; local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos); local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos); local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y)); local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16)); local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16)); barrier(CLK_LOCAL_MEM_FENCE); uint4 diff = (uint4)(0, 0, 0, 0); for (int i=0; i<16; i++) { for (int j=0; j<16; j++) { diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ]; } } write_imageui(dest, pos, diff); }
Result: the output is correct, the operating time is 56% slower. If only local_src is used (not local_src2), the result will be ~ 10% faster.
EDIT . Compared to much more powerful hardware, the AMD Radeon HD 7850 gets 420GFLOPS, specification 1751GFLOPS. In fairness, it should be said that the specification is intended to be added multiple times, and there is no reproduction here, so it is expected to be ~ 875GFLOPS, but this is still quite a lot compared to theoretical performance.
EDIT . To facilitate the execution of tests for those who would like to try this, the host-side program in PyOpenCL is below:
import pyopencl as cl import numpy import numpy.random from time import time CL_SOURCE = ''' // kernel goes here ''' ctx = cl.create_some_context() queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE) prg = cl.Program(ctx, CL_SOURCE).build() h, w = 1024, 1024 src = numpy.zeros((h, w, 4), dtype=numpy.uint8) src[:,:,:] = numpy.random.rand(h, w, 4) * 255 mf = cl.mem_flags src_buf = cl.image_from_array(ctx, src, 4) fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8) dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
EDIT : reflecting on memory access patterns, the original naive version might be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work items in the workgroup read the same location (so is this just one read?), and when by calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they read sequential addresses (so that reading can be combined fine?).