Generalized computations of a sliding window on a GPU

Question

Generalized computations of a sliding window on a GPU

Here is the Python code that implements the calculation of a sliding window on two three-dimensional matrices: X and Y.

import numpy def sliding_dot( X,Y ) : assert X.ndim == Y.ndim == 3 iw,ih,id = X.shape fw,fh,fd = Y.shape assert id == fd assert fw < iw and fh < ih ow,oh = iw-fw+1,ih-fh+1 out = numpy.zeros( [ow,oh] ) for x in xrange(ow) : for y in xrange(oh) : window = X[x:x+fw,y:y+fh,:] out[x,y] = numpy.dot( window.flatten(),Y.flatten() ) return out ################# A_dims = (640,480,32) B_dims = (6,6,32) A = numpy.random.rand(*A_dims) B = numpy.random.rand(*B_dims) sliding_dot(A,B)

In the general case, Y is always much smaller than X in the first and second dimensions, but they are equal in the third dimension.

Note that we could replace numpy.dot () with any function from Y and the window. This is slightly different from convolution in that Y only glides over the first and second dimensions of X. I am looking for an effective strategy for implementing this kind of sliding window computation, effectively using CUDA. Anyone want to offer me some direction? Hooray!

Refresh . You can see how I work in the optimization process with the help of other users in my answer below.

+7

gpu cuda sliding-window dot-product

Brianthelion Oct 05 '11 at 3:01

source share

4 answers

talonmies · Answer 1 · 2011-10-11T10:25:25+0000

Trying to develop a “generic” implementation that can accommodate just about any operation you might want will be a huge compromise in an architecture like CUDA. For your specific point example, which is a typical pruning operation, this is a pretty useful implementation:

 __constant__ int ldaX[3]; __constant__ int ldaY[3]; __constant__ int dimX[3]; __constant__ int dimY[3]; template<typename real,int blocksize> __global__ void sliding_k(const real *X, const real *Y, real *out) { __shared__ volatile real buffer[blocksize]; int tid = threadIdx.x; int gid = blockIdx.x * gridDim.y + blockIdx.y; real value = (real)0; int xpos = (blockIdx.y * ldaX[2]) + (blockIdx.x * ldaX[1]); int ypos = 0; for(int i=0; i<dimY[0]; i++) { for(int jk=tid; jk<ldaY[1]; jk+=blocksize) { value += X[xpos+jk] * Y[ypos+jk]; } xpos += ldaX[1]; ypos += ldaY[1]; } buffer[tid] = value; __syncthreads(); # pragma unroll for(int i=(tid+32); ((tid<32)&&(i<blocksize)); i+=32) buffer[tid] += buffer[i]; if (tid < 16) buffer[tid] += buffer[tid + 16]; if (tid < 8) buffer[tid] += buffer[tid + 8]; if (tid < 4) buffer[tid] += buffer[tid + 4]; if (tid < 2) buffer[tid] += buffer[tid + 2]; if (tid == 0) out[gid] = buffer[0] + buffer[1]; }

You can replace any reduction operator that you like with the multiply / add floating point operation that the dot product uses, and the code should work fine. Each window calculation is performed in one block. Enough parallel work to justify in this window the block size per window. This allows you to combine access to global memory, and Fermi cards have a good number of L1 caches.

Here I only create one assumption in the code, since the third dimension of the original array and the array of windows are equal. This allows the inner two circuits to merge into one operation, since they share a common common memory format. Running test harness in Python using an improved version of your link code with host code written in PyCUDA, I get the following:

 In [15]: %timeit -n3 -r3 out2=sliding_cuda(A,B) 3 loops, best of 3: 49.8 ms per loop In [16]: %timeit -n3 -r3 out=sliding_dot(A,B) 3 loops, best of 3: 2.18 s per loop In [17]: (numpy.abs(out2-out)/numpy.abs(out)).max() Out[17]: 4.2921323635558404e-15

when running on 3GHz Phenom II with GTX470 using 64 streams on a 635x475 2D grid, i.e. about 50 times faster, including loading the module, setting up and transferring memory using memory allocations on the page. The kernel itself is about 100 times faster than Python, without memory or configuration overhead. Note that this is a double-precision version - Python uses double-precision floating-point arithmetic by default.

aland · Answer 2 · 2011-10-05T21:47:57+0000

Well, here are some of them:

You are doing ~ 640 * 480 iterations of numpy.dot , which itself handles 6 * 6 * 32 elements. A parallel point product is hardly worth it: 192 parallel threads are not enough for the GPU, and CUDA reduction is an additional problem. So IMO, the best way to parallelize your task is to assign each stream to each element of the output array.

Now about the memory: the output array will be in global memory, there will be no choice. For the input, A looks good for texture memory, as adjacent threads access adjacent elements. In addition, you can manually "cache" it in shared memory, but in this case it does not look very profitable, just using the texture. Shared memory is not good for B , since it can lead to conflicts in banks, since when calculating a point product, all threads in a half-pattern get access to the same B element (you can start summing from different elements in different flows, but what (again) doesn't look promising). Thus, the choice is either a texture or a constant. I vote for the constant, because (a) read-only memory is suitable for data accessed by all the streams on the device, (b) you will not pollute the texture cache.

My guesses are above all, and to achieve good performance you better try different options ...

Updating your naive implementation

 for (int Yi = 0; Yi < Ydims[0]; Yi++ )

Here you make aceess for global memory at each iteration. This is a huge performance killer. Since you have 3 dimensions, it is better to replace int *Ydims with int3 Ydims (same for Xdims and outdims ).

 out[out_indx] += X[X_indx]*Y[Y_indx];

Again, a very bad idea. Create a register variable and perform all operations with it. Write to the global array only once at the end of the kernel.

These optimizations are the first thing you should do. Secondly, you have to make X and Y 3D textures, so access to them will be cached. I think after that CUDA will surpass the processor.

For further optimizations, you should read the CUDA C Best Practice Guide . It should read, and you will get a much better idea of how to write efficient GPU code (right now your implementation is naively naive)

Brianthelion · Answer 3 · 2011-10-10T02:44:31+0000

v0.1 - Naive implementation

Here is my first, naive attempt to do this job:

 __global__ void sliding_dot(float *out, int *outdims, float *X, int *Xdims, float *Y, int *Ydims ) { int i = threadIdx.x + blockDim.x * blockIdx.x; int j = threadIdx.y + blockDim.y * blockIdx.y; int Y_indx = 0; int X_indx = 0; if ( i < outdims[0] & j < outdims[1] ) { int out_indx = j + i*outdims[1]; for (int Yi = 0; Yi < Ydims[0]; Yi++ ) { for (int Yj = 0; Yj < Ydims[1]; Yj++ ) { for (int k = 0; k < Ydims[2]; k++ ) { Y_indx = k + Yj* Ydims[2] + Yi* Ydims[2]*Ydims[1]; X_indx = k + (j+Yj)*Xdims[2] + (i+Yi)*Xdims[2]*Xdims[1]; out[out_indx] += X[X_indx]*Y[Y_indx]; } } } } }

So far, the results are less desirable. With the block size (32,32,1) and the mesh sizes p, q are chosen so that p * 32> = outdims [0] and q * 32> = outdims [1]:

 method=[ sliding_dot ] gputime=[ 7013.280 ] cputime=[ 18.000 ] occupancy=[ 0.667 ] method=[ sliding_dot ] gputime=[ 6945.184 ] cputime=[ 7.000 ] occupancy=[ 0.667 ] method=[ sliding_dot ] gputime=[ 6990.816 ] cputime=[ 6.000 ] occupancy=[ 0.667 ] method=[ sliding_dot ] gputime=[ 6931.648 ] cputime=[ 6.000 ] occupancy=[ 0.667 ]

v0.2 - `texture<float,1>`

I hope everyone learns the same way I do! I followed @aland's recommendations and got significant speedup:

 texture<float,1> X; texture<float,1> Y; __global__ void dotconv(float *out, int2 outdims, int3 Xdims, int3 Ydims ) { int i = threadIdx.x + blockDim.x * blockIdx.x; int j = threadIdx.y + blockDim.y * blockIdx.y; if ( i < outdims.x & j < outdims.y ) { int out_indx = j + i*outdims.y; float total = 0.0f; int X_indx = 0; int Y_indx = 0; for (int Yi=0; Yi<Ydims.x; Yi++ ) { for (int Yj=0; Yj<Ydims.y; Yj++ ) { for (int k=0; k<Ydims.z; k++ ) { Y_indx = k + Yj* Ydims.z + Yi* Ydims.z*Ydims.y; X_indx = k + (j+Yj)*Xdims.z + (i+Yi)*Xdims.z*Xdims.y; total += tex1Dfetch(X,X_indx)*tex1Dfetch(Y,Y_indx); } } } out[out_indx] = total; } }

But we still do not work as fast as the CPU:

 method=[ dotconv ] gputime=[ 2224.928 ] cputime=[ 24.000 ] occupancy=[ 0.667 ] method=[ dotconv ] gputime=[ 2222.592 ] cputime=[ 7.000 ] occupancy=[ 0.667 ] method=[ dotconv ] gputime=[ 2225.216 ] cputime=[ 10.000 ] occupancy=[ 0.667 ] method=[ dotconv ] gputime=[ 2222.752 ] cputime=[ 10.000 ] occupancy=[ 0.667 ]

v0.3 - `texture<float,3>`

 texture<float,3,cudaReadModeElementType> X; texture<float,3,cudaReadModeElementType> Y; __global__ void dotconv(float *out, int2 outdims, int3 Xdims, int3 Ydims ) { int i = threadIdx.x + blockDim.x * blockIdx.x; int j = threadIdx.y + blockDim.y * blockIdx.y; if ( i < outdims.x & j < outdims.y ) { int out_indx = j + i*outdims.y; float total = 0.0f; for (int Yi=0; Yi<Ydims.x; Yi++ ) { for (int Yj=0; Yj<Ydims.y; Yj++ ) { for (int k=0; k<Ydims.z; k++ ) { total += tex3D(X,k,j+Yj,i+Yi) * tex3D(Y,k,Yj,Yi); } } } out[out_indx] = total; } }

It is actually a bit slower than v0.2

 method=[ dotconv ] gputime=[ 2403.360 ] cputime=[ 35.000 ] occupancy=[ 0.667 ] method=[ dotconv ] gputime=[ 2392.160 ] cputime=[ 15.000 ] occupancy=[ 0.667 ] method=[ dotconv ] gputime=[ 2396.448 ] cputime=[ 15.000 ] occupancy=[ 0.667 ] method=[ dotconv ] gputime=[ 2398.880 ] cputime=[ 16.000 ] occupancy=[ 0.667 ]

Thanks for your suggestions!

mike_f · Answer 4 · 2011-10-10T18:22:28+0000

You might want to separate your readings from your sums from your stores.

So, each core should have 3 partitions:

Read textures from memory, save to shared memory for the entire block

 __shared blockX[ Ydims.z ][ Ydims.y ][ Ydims.x ]; __shared blockY[ Ydims.z ][ Ydims.y ][ Ydims.x ]; // NOTE: MAKE EACH THREAD LOAD k ELEMENTs * 2 rather than each thread loading Ydims.X*Y*Z elements blockX[k][yj][yi] = ... blockY[k][yj][yi] = ... __syncthreads(); // <-- critical -- all threads in block must finish // reading from shared memory before any may use the values.

#pragma Expand for loops.
This will significantly increase your ILP and will have much less branching for your constant loop sizes.
Make sure that access to your shared memory is configured accordingly, otherwise conflicts in the bank will kill your productivity.

Generalized computations of a sliding window on a GPU

v0.1 - Naive implementation

v0.2 - texture<float,1>

v0.3 - texture<float,3>

More articles:

v0.2 - `texture<float,1>`

v0.3 - `texture<float,3>`