目录
MVSNet:Depth Inference for Unstructured Multi-view Stereo. (ECCV2018 )
MVSNet:Depth Inference for Unstructured Multi-view Stereo. (ECCV2018 )
Yao Yao1, Zixin Luo1, Shiwei Li1, Tian Fang2, and Long Quan1
The Hong Kong University of Science and Technology
HomePage: https://www.cse.ust.hk/~yyaoag/ Code: https://github.com/YoYo000/MVSNet
网络阅读链接:
https://blog.csdn.net/john_xia/article/details/88100410
摘要
We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed MVSNet is demonstrated on the large-scale indoor DTU dataset. With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks and Temples dataset, where our method ranks first before April 18, 2018 without any fine-tuning, showing the strong generalization ability of MVSNet
提出了一种基于多视图图像的深度图推理的端到端深度学习体系结构。在网络中,我们首先提取深度视觉图像特征,然后通过可微分的homography变换在参考相机截锥(frustum)上建立三维cost volume。接下来,我们应用三维卷积对初始深度图进行正则化和回归,然后用参考图像对初始深度图进行细化,生成最终的输出。我们的框架使用基于方差的成本度量灵活地适应任意n视图输入,该度量将多个特性映射到一个成本特性中。在大型室内DTU数据集上演示了所提出的MVSNET。通过简单的后处理,我们的方法不仅显著优于以前的技术状态,而且在运行时速度快了几倍。我们还对复杂的室外Tanks and Temples 数据集进行了MVSNET评估,在2018年4月18日之前,我们的方法排名第一,没有任何微调,显示出MVSNET强大的泛化能力。
1 Introduction
传统方法:Traditional methods use hand-crafted similarity metrics and engineeredregularizations (e.g., normalized cross correlation and semi-global matching [12]) to compute dense correspondences and recover 3D points.
缺点:low-textured, specular and reflective regions of the scene make dense matching intractable and thus lead to incomplete reconstructions.
之前的CNN方法:
two-view stereo缺点:fails to fully utilize the multi-view information and leads to less accurate result.
SurfaceNet,Learned Stereo Machine (LSM)缺点:both the two methods exploit the volumetric representation of regular grids; huge memory consumption of 3D volumes;low volume resolution OR takes a long time。
提出的方法:
(1)computes one depth map at each time, rather than the whole 3D scene at once.
(2) differentiable homography warping operation, which build the 3D cost volumes from 2D image features and enables the end-to-end training.
(3) variance-based metric, which maps multiple features into one cost feature in the volume, adapt arbitrary number of source images in the input
提出方法与之前CNN方法的差别:
(1)our 3D cost volume is built upon the camera frustum instead of the regular Euclidean space.
(2) our method decouples the MVS reconstruction to smaller problems of per-view depth map estimation, which makes large-scale reconstruction possible.
2 Related work
根据输出形式,MVS方法分为3类:
(1)direct point cloud reconstructions.
缺点:propagation of point clouds are difficult to be fully parallelized;take a long time.
(2)volumetric reconstructions.
缺点:space discretization error and the high memory consumption.
(3)depth map reconstructions