THESIS
2019
xiv, 93 pages : illustrations ; 30 cm
Abstract
Multi-view stereo (MVS) reconstructs 3D representations of the scene from imagery, which
is a core problem of computer vision extensively studied for decades. Traditionally, MVS algorithms
apply hand-crafted similarity metrics and engineered regularizations to compute dense
correspondences. While these methods have shown great results under ideal Lambertian scenarios,
classical MVS algorithms still suffer from numerous artifacts. In this thesis, we propose
to advance the MVS reconstruction using recent deep learning techniques.
First, we present an end-to-end deep learning architecture, MVSNet, for depth map inference
from multi-view images. The key contribution of this part is the careful integration
between multi-view geometries and convolutional neural networks (CNNs). In the...[
Read more ]
Multi-view stereo (MVS) reconstructs 3D representations of the scene from imagery, which
is a core problem of computer vision extensively studied for decades. Traditionally, MVS algorithms
apply hand-crafted similarity metrics and engineered regularizations to compute dense
correspondences. While these methods have shown great results under ideal Lambertian scenarios,
classical MVS algorithms still suffer from numerous artifacts. In this thesis, we propose
to advance the MVS reconstruction using recent deep learning techniques.
First, we present an end-to-end deep learning architecture, MVSNet, for depth map inference
from multi-view images. The key contribution of this part is the careful integration
between multi-view geometries and convolutional neural networks (CNNs). In the network, we
extract deep image features and build the 3D cost volume upon the camera frustum via the differentiable
homography warping. Then, 3D convolutions are applied to regularize and regress
the output depth map. We demonstrate on DTU dataset that MVSNet significantly outperforms
previous state-of-the-arts in both reconstruction completeness and overall quality.
Next, we propose to extend the MVSNet architecture for large-scale MVS reconstruction.
One major limitation of current learning-based approaches is the scalability: the memory-consuming
cost volume regularization makes the learned MVS hard to be applied to high-resolution
scenes. To this end, we sequentially regularize 2D cost maps via the gated recurrent
unit (GRU) rather than regularize the entire 3D cost volume in one go. The GRU regularization
dramatically reduces memory consumption and makes high-resolution reconstructions feasible.
The proposed R-MVSNet is evaluated on the large-scale Tanks and Temples dataset and
achieves comparable results to classical large-scale MVS algorithms.
Finally, we establish a large-scale synthetic MVS dataset, BlendedMVS, based on blended
images and rendered depth maps. While several MVS datasets have been proposed, they fail
to provide accurate depth and occlusion information as ground truth mesh models are usually
incomplete. We therefore establish a new MVS dataset based on model rendering. Textured
meshes are first reconstructed from images of different scenes, which are then rendered into
color images, depth maps and occlusion maps. We further blend rendered images with input
images using high-pass and low-pass filters to generate our training input. Extensive experiments
demonstrate that models trained on BlendedMVS achieve significant better generalization
ability compared with models trained on other MVS datasets.
In sum, this thesis presents a complete learning-based solution to large-scale multi-view
stereopsis, including a current baseline network (MVSNet), its large-scale extension (R-MVSNet)
and a large-scale synthetic dataset (BlendedMVS). We bridge the gap between classical MVS
reconstructions and recent deep learning techniques and demonstrate the effectiveness of the
learning-based MVS through extensive experiments on different datasets.
Post a Comment