You need some kind of global / local tracking method that has weights for terms such as spatial consistency (how much is it moved relative to another), shape consistency (how much is it moved) and penalties for merging / splitting tracks.
A similar problem is associated with cell tracking in biomedical imaging. Some links from this conference here may be helpful.
Edit:
bjoernz makes a wonderful point in the comments. If you can add some form of fiducials to the scene, the task will be much simpler.
It does not even have to be a visible wavelength signal. For example, you can draw a sheet with infrared reflective paint and use an IR camera to select it. The IR camera can be aimed with the usual visible wavelength.
For a clean regular vision solution, my answer is above.
source share