Abyssal performance of OpenCL ImageSampling and OpenGL TextureSampling

I recently ported my volumeraycaster from OpenGL to OpenCL, which reduced the performance of raycaster by about 90 percent. I tracked performance degradation with OpenCL image processing functions, which are much slower than the corresponding OpenGL texturing functions. By removing image replication and texture fetching features, both versions of raycaster had about the same speed. To easily scan functions on different hardware and eliminate some stupid errors in the rest of my RT code, I wrote a small test that compares OpenCL sample rate with OpenGL sample rate and tested it on different machines, but OpenCL still had about 10% performance Opengl

OpenCL HostCode test (at least the most important part of it):

void OGLWidget::OCLImageSampleTest() { try { int size=8; float Values[4*size*size*size]; cl::Kernel kernel=cl::Kernel(program,"ImageSampleTest",NULL); cl::ImageFormat FormatA(CL_RGBA,CL_FLOAT); cl::Image3D CLImage(CLcontext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,FormatA,size,size,size,0,0,Values,NULL); cl::ImageFormat FormatB(CL_RGBA,CL_UNSIGNED_INT8); cl::Image2D TempImage(CLcontext, CL_MEM_WRITE_ONLY,FormatB,1024,1024,0,NULL,NULL ); kernel.setArg(0, CLImage); kernel.setArg(1, TempImage); cl::Sampler Samp; Samp() = clCreateSampler( CLcontext(), CL_TRUE, CL_ADDRESS_REPEAT, CL_FILTER_LINEAR, NULL); kernel.setArg(2, Samp); QTime BenchmarkTimer=QTime(); BenchmarkTimer.start(); cl::KernelFunctor func = kernel.bind(queue, cl::NDRange(1024,1024), cl::NDRange(32,32)); func().wait(); int Duration = BenchmarkTimer.elapsed(); printf("OCLImageSampleTest: %d ms \n", Duration); } catch (cl::Error& err) { std::cerr << "An OpenCL error occured, " << err.what() << "\nError num of " << err.err() << "\n"; return; } } 

OpenCL Kernel:

 void kernel ImageSampleTest( read_only image3d_t CoordTexture, write_only image2d_t FrameBuffer, sampler_t smp) { int Screenx = get_global_id(0); int Screeny = get_global_id(1); int2 PositionOnScreen=(int2)(Screenx,Screeny) ; float4 Testvec=(float4)(1,1,1,1); for(int i=0; i< 2000; i++) { Testvec+= read_imagef(CoordTexture,smp, (float4)(0+0.00000001*i,0,0,0)); // i makes sure that the compiler doesn't unroll the loop } uint4 ToInt=(uint4)( (uint) (Testvec.x), (uint) (Testvec.y) ,(uint)(Testvec.z),1); write_imageui ( FrameBuffer, PositionOnScreen, ToInt ); } 

OpenGL FragmentShader for a full-screen ATV that has the same number of fragments as the OpenCL core has work elements:

 #version 150 uniform sampler3D Tex; out vec4 FragColor; void main() { FragColor=vec4(0,0,0,0); for(int i=0; i<2000; i++) { FragColor+= texture(Tex,vec3(0+0.00000001*i,0,0),0); } } 

In addition, I have already tried the following to improve performance:

-Changes the size of the workgroup: there is no increase in productivity

-Different Hardware: 280 GTX, 580 GTX, some Fermi Tessla cards, all had the same horrible performance in OpenCL vs OpenGL

- Various texture formats (bytes instead of floats), different access patterns and different texture sizes: no increase

-Using a buffer instead of an image for data and a self-written trilinear interpolation function for sampling in the CL core: increase OpenCL performance by about 100%

-Using a 2D image // texture instead of a 3D image // texture: this increased the performance of OpenCL by 100%, although the performance of OpenGL did not change at all.

-Use the β€œclosest” instead of β€œlinear” interpolation: without changing performance

This left me wondering: I made a very dumb mistake that reduces OpenCL performance? Why is OpenCL sampling performance so low, although it should use the same hardware as OpenGL? Why is my complex trilinear interpolation function faster than its hardware implementation? How to increase sampling performance in OpenCL so that I can have the same speed as in OpenGL?

+54
image opencl sampling opengl textures
Jun 11 '12 at 16:18
source share
1 answer

I suspect there are some problems with OpenCL in the latest NVidia drivers on some graphics cards. Here and here there are some reports about them. Try the GPU test from a different family.

+2
Nov 04 '13 at 7:55
source share



All Articles