Not sure if this is not suitable for answering my own question. But, since no one answers this question, and I did some research work, not a success story, I think it's better to share some. This may be useful for those who have the same problem.
I have a sequence of images capturing the scene of urban transport. Images are captured using a smartphone on a moving car every 0.5 seconds.
For testing purposes, I use only a few pairs of images for testing instead of a whole sequence. I got some consistent points using KLT and performed a two-step deletion deletion. The corresponding results are good, not a single, very few inconsistencies.
To reject points on moving objects, I followed the work presented below:
Jung B. and Sukhatme, GS, 2004. "Detecting moving objects using a single camera on a mobile robot in an outdoor environment" (the revised version presented in the journal is called "real-time motion tracking from a mobile robot")
To the resume, in terms of their work, they reject outliers (moving objects) by calculating a transformation model between pairs of images. In the work, a bilinear model was used. The procedure is that they calculate the parameters of the transformation model T and reject the consistent ones if | x2 - T (x1) | <threshold. Here x2 and x1 mean a pair of corresponding points in the image at time t2 and t1.
I tried T as an affine model, a bilinear model, and a pseudo-perspective model. My experimental results show that if the number of moving objects is small, this procedure will always fail because they rely on consistent points. In my case, images are captured on a city highway in which there are a lot of moving objects. Therefore, I cannot refuse emissions because of this technique. Therefore, I believe that RANSAC will not help either. Therefore, many works involve a small number of moving objects. Among these three models, I found that affinity shows show the worst result, but cannot say which one is better among the other two.
Hope this helps.
Sonia
source share