I have a few thoughts and a possible solution that you can consider.
First, consider tracking individual delta pixels and transmitting / saving only those. A typical interactive session typically includes very small parts of a user interface change; moving or resizing windows is usually less common (anecdotally) for long sessions of using a computer. It effectively captures simple things like typed text, cursor movements, and small user interface updates without much extra work.
You can also try connecting the OS at a lower level to get, for example, displaying a list of pixels or even (optimally) a list of rectangles of “damage”. For example, Mac OS X quartz picker may provide you with this information. This can help you quickly narrow down what needs to be updated, and ideally can give you an effective view of the screen on its own.
If you can request information about Windows (the window manager) about windows, you can store separate data streams (pixel deltas) for each visible window, and then take a simple approach to displaying the list to "render" them during playback. It is then trivial to identify moving windows, as you can simply distinguish between the displayed lists.
If you can request OS information about the cursor position, you can use the cursor movement to quickly evaluate the delta movement, since the cursor moves, as a rule, correlates well with the movement of the object on the screen (for example, moving windows, icons, dragging objects, etc.). d.).). This avoids image processing to determine delta motion.
In the case of a possible solution (or in the extreme case, if you still cannot determine the delta of motion with the above): we can deal with the (very common) case of one moving rectangle quite easily. Make a mask of all the pixels that change in the frame. Identify the largest connected component in the mask. If it approaches a rectangle, you can assume that it represents a displaced area. Either the window moves exactly orthogonally (for example, entirely in the x or y direction), in which case the common delta looks like a slightly larger rectangle, or the window moves diagonally, in which case the common delta will have an 8-sided shape. In any case, you can evaluate the motion vector and check this by changing the areas. Please note that this procedure intentionally ignores details that you will have to consider, for example. pixels that move independently of windows, or areas that do not change (for example, large blocks of solid color in a window). A practical implementation would have to cope with all of the above.
Finally, I will review the existing literature on real-time motion estimation. A lot of work has been done to optimize motion estimation and compensation, for example. video encoding, so you can also use this work if you find that the methods above are inadequate.