Optimization of bilinear random access sampling

Question

Optimization of bilinear random access sampling

I am working on an old-school image warp filter. Essentially, I have a 2D array of pixels (currently ignoring the question of whether they are color, grayscale, float, RGBA, etc.) and another 2D array of vectors (with floating point components), moreover the image is in less than a vector array. In pseudocode, I want to do this:

FOR EACH PIXEL (x,y) vec = vectors[x,y] // Get vector val = get(img, x + vec.x, y + vec.y) // Get input at <x,y> + vec output[x,y] = val // Write to output

The catch is that get() needs to bilinearly try an input image because vectors can refer to sub-pixel coordinates. But unlike bilinear sampling, say, in texturing, where we can work with interpolation mathematics in a loop, so all this just adds, here the reading comes from random locations. Therefore, the definition of get() looks something like this:

 FUNCTION get(in,x,y) ix = floor(x); iy = floor(y) // Integer upper-left coordinates xf = x - ix; yf = y - iy // Fractional parts a = in[ix,iy]; b = in[iy+1,iy] // Four bordering pixel values b = in[ix,iy+1]; d = in[ix+1,iy+1] ab = lerp(a,b,xf) // Interpolate cd = lerp(c,d,xf) RETURN lerp(ab,cd,yf)

and lerp() just

 FUNCTION lerp(a,b,x) RETURN (1-x)*a + x*b

Assuming that neither the input image nor the vector array are known in advance, what high-level optimizations are possible? (Note: “Use the GPU” changes.) I might think of rearranging the interpolation maths in get() so that we can cache pixel readings and intermediate calculations for this (ix, iy). Thus, if sequential access to the same subpixel square, we can avoid some work. If the vector array is known in advance, then we can rebuild it so that the coordinates going to get() are more local. It may also help in localizing the cache, but because the entries in output will be everywhere. But then we cannot do fancy things, such as scaling vectors on the fly or even moving the warp effect from its original pre-calculated location.

The only other possibility would be to use fixed-point vector components, possibly with very limited fractional parts. For example, if vectors have only 2-bit fractional components, then there are only 16 sub-pixel areas that can be accessed. We could precompote weights for them and avoid much of the interpolation math, but with a blow to quality.

Any other ideas? I want to accumulate several different methods before I implement them and see which one is better. If someone could point me to the source code of a quick implementation, that would be great.

+4

optimization graphics

Maskull Oct 27 '09 at 23:46

source share

1 answer

Tom leys · Accepted Answer · 2009-11-23T03:27:04+0000

An interesting problem.

Your definition of the problem basically provided unpredictable access to [x, y] - since any vector can be provided. Assuming that the vector image tends to refer to local pixels, the very first optimization will be to ensure that the memory is moved in the appropriate order to maximize cache locality. This may mean scanning 32 * 32 blocks in a "for every pixel" cycle so that the same pixels appear in [x, y] as soon as possible in a short time.

Most likely, the performance of your algorithm will be connected by two things

How fast can you load vectors[x,y] and in[x,y] from main memory
How long does it take to multiply and sum

There are SSE instructions that can multiply several elements at the same time, and then add them together (multiply and accumulate). What you have to do is calculate

 af = (1 - xf) * ( 1 - yf ) bf = ( xf) * ( 1 - yf ) cf = (1 - xf) * ( yf ) df = ( xf) * ( yf )

and then calculate

 a *= af b *= bf c *= cf d *= cf return (a + b + c + d)

There is a good chance that both of these steps can be performed with a surprisingly small number of SSE instructions (depending on your presentation of pixels).

I think that caching intermediate values is very unlikely to be useful - it seems extremely unlikely that> 1% of vector requests will point to the same place, and caching will cost you much more in memory bandwidth than it will save.

If you use the prefetch instructions on your processor to prefetch in[vectors[x+1, y]] as the vectors[x,y] process, you can improve memory performance, so the CPU cannot predict a random walk around the memory otherwise case.

The last way to improve the performance of your algorithm is to process fragments of the input pixels at the same time, i.e. x[0..4], x[5..8] - this allows you to expand the internal mathematics cycles. However, you will probably be connected with the memory that this will not help.

Optimization of bilinear random access sampling

More articles: