In computer vision, what does MVS do that SFM cannot?

I am a developer with ten years of corporate software development under his belt, and my interests in the field of hobbies have forced me into a huge and terrible field of computer vision (CV).

One thing that is not immediately clear to me is the division of labor between the Structure from Motion (SFM) tools and the Multi View Stereo Tools (MVS) .

In particular, CMVS seems to be the best tool in the MVS show, and the Bundler seems to be one of the best open source SFM tools.

Taken from the CMVS homepage:

You should ALWAYS use CMVS after the Bundler and before PMVS2

I wonder: why?!? My understanding of SFM tools is that they do 3D reconstruction for you, so why do we need MVS tools in the first place? What value / processing / functions do they add that SFM tools such as the Bundler cannot address? Why the proposed pipeline:

Bundler -> CMVS -> PMVS2 

?

+5
source share
1 answer

Quickly set, the structure methods from Motion (SfM) and MultiView Stereo (MVS) are complementary, since they are not related to the same assumptions. They are also slightly different in their inputs, MVS requires that the camera parameters be executed, which is evaluated (displayed) by SfM. SfM gives only coarse 3D output, while PMVS2 gives a denser output, and finally CMVS has to bypass some of the limitations of PMVS2.

The rest of the answer provides a high-level overview of how each method works, explaining why this is so.

Structure out of motion

The first step of the project that you highlighted is the SfM algorithm, which can be done using Bundler , VisualSFM , OpenMVG, or the like. This algorithm takes input images and displays camera parameters for each image (more on this later), as well as a three-dimensional shape of a rough scene, often called a rare reconstruction.

Why does SfM only display a coarse three-dimensional shape? Basically, SfM methods begin by detecting two-dimensional objects in each input image and matching these features between pairs of images. The goal, for example, is to say that "this corner table is located at these pixel positions in these images." These functions are described by what we call descriptors (for example, SIFT or ORB). These descriptors are created to represent a small area (i.e. a bunch of neighboring pixels) in images. They may represent robust textured or rough geometry (e.g. edges), but these scene functions must be unique (in the sense that they differ) throughout the scene to be useful. For example (possibly simplified), a wall with repeating patterns would not be very useful for reconstruction, because, despite the high texturing, each area of ​​the wall could be largely matched everywhere on the wall. Since SfM performs 3D reconstruction using these functions, the reconstruction tops of the 3D scene will be located on these unique textures or edges, giving a coarse mesh as the output. SfM usually does not create a vertex in the middle of the surface without an accurate and distinctive texture. But, when many matches are found between the images, a three-dimensional transformation matrix between the images can be calculated, effectively providing a relative three-dimensional position between the two camera poses.

Multiview stereo

The MVS algorithm is then used to refine the mesh obtained by the SfM method, which leads to what is called dense reconstruction . This algorithm requires that the camera parameters of each image work, which is output by the SfM algorithm. Since he is working on a more limited problem (since they already have camera settings for each image, such as position, rotation, focus, etc.), MVS will calculate 3D vertices in areas that were not (or cannot be) ) are correctly detected by descriptors or match. This is what PMVS2 does.

How can PMVS work in regions where it is difficult to describe a 2D function descriptor? Since you know the parameters of the camera, you know that a given pixel in an image is a projection of a line onto another image. This approach is called epipolar geometry . While SfM had to search the entire 2D image for each descriptor to find a potential match, MVS will work on the same 1st line to find matches, which simplified the problem. Thus, MVS usually considers lighting and object materials in its optimization, which SfM does not.

However, there is one problem: PMVS2 performs quite complex optimization, which can be terribly slow or use the astronomical amount of memory for large sequences of images. Here CMVS comes into play, clustering the rough output of 3D SfM into regions. Then PMVS2 will be called (potentially in parallel) on each cluster, which will simplify its execution. CMVS will then combine each PMVS2 pin into a single detailed model.

Conclusion

Most of the information provided in this answer and much more can be found in this lesson from Yasutaka Furukawa , author of CMVS and PMVS2: http://www.cse.wustl.edu/~furukawa/papers/fnt_mvs.pdf

In essence, both methods come out of two different approaches: SfM seeks to perform 3D reconstruction using a structured (but unknown) sequence of images, while MVS is a generalization of stereo images with two types of vision based on human stereotypes.

+6
source

All Articles