I am working on an old-school image warp filter. Essentially, I have a 2D array of pixels (currently ignoring the question of whether they are color, grayscale, float, RGBA, etc.) and another 2D array of vectors (with floating point components), moreover the image is in less than a vector array. In pseudocode, I want to do this:
FOR EACH PIXEL (x,y) vec = vectors[x,y]
The catch is that get() needs to bilinearly try an input image because vectors can refer to sub-pixel coordinates. But unlike bilinear sampling, say, in texturing, where we can work with interpolation mathematics in a loop, so all this just adds, here the reading comes from random locations. Therefore, the definition of get() looks something like this:
FUNCTION get(in,x,y) ix = floor(x); iy = floor(y) // Integer upper-left coordinates xf = x - ix; yf = y - iy // Fractional parts a = in[ix,iy]; b = in[iy+1,iy] // Four bordering pixel values b = in[ix,iy+1]; d = in[ix+1,iy+1] ab = lerp(a,b,xf) // Interpolate cd = lerp(c,d,xf) RETURN lerp(ab,cd,yf)
and lerp() just
FUNCTION lerp(a,b,x) RETURN (1-x)*a + x*b
Assuming that neither the input image nor the vector array are known in advance, what high-level optimizations are possible? (Note: βUse the GPUβ changes.) I might think of rearranging the interpolation maths in get() so that we can cache pixel readings and intermediate calculations for this (ix, iy). Thus, if sequential access to the same subpixel square, we can avoid some work. If the vector array is known in advance, then we can rebuild it so that the coordinates going to get() are more local. It may also help in localizing the cache, but because the entries in output will be everywhere. But then we cannot do fancy things, such as scaling vectors on the fly or even moving the warp effect from its original pre-calculated location.
The only other possibility would be to use fixed-point vector components, possibly with very limited fractional parts. For example, if vectors have only 2-bit fractional components, then there are only 16 sub-pixel areas that can be accessed. We could precompote weights for them and avoid much of the interpolation math, but with a blow to quality.
Any other ideas? I want to accumulate several different methods before I implement them and see which one is better. If someone could point me to the source code of a quick implementation, that would be great.