MVSNet,CVP-MVSNet论文阅读和代码解析

最新推荐文章于 2024-05-28 09:42:26 发布

陈同学_alex

最新推荐文章于 2024-05-28 09:42:26 发布

阅读量583

点赞数 1

分类专栏：计算机视觉自动驾驶感知文章标签：论文阅读 MVSNet 多视图立体计算机视觉

本文链接：https://blog.csdn.net/qq_37394634/article/details/132804372

版权

计算机视觉同时被 2 个专栏收录

75 篇文章

订阅专栏

自动驾驶感知

6 篇文章

订阅专栏

基础知识

单应变换

单应矩阵的概念

单应矩阵（Homography）H 描述了两个平面之间的映射关系。考虑图像 $I_{1}$ 和 $I_{2}$ 有一对匹配好的特征点 $p_{1}$ 和 $p_{2}$ 。这些特征点对应的3D点 $\mathbf{P}$ 落在某平面上。设这个平面满足方程:
$\boldsymbol{n}^{T} \boldsymbol{P}+d=0 .$
稍加整理, 得:
$-\frac{\boldsymbol{n}^{T} \boldsymbol{P}}{d}=1$
假设 $\mathbf{P}$ 是图像 $I_{1}$ 相机坐标系下的3D坐标， $\mathbf{R},\mathbf{t}$ 是相对位姿，则有:
$\begin{aligned} \boldsymbol{p}_{2} & =\boldsymbol{K}(\boldsymbol{R P}+\boldsymbol{t}) \\ & =\boldsymbol{K}\left(\boldsymbol{R P}+\boldsymbol{t} \cdot\left(-\frac{\boldsymbol{n}^{T} \boldsymbol{P}}{d}\right)\right) \\ & =\boldsymbol{K}\left(\boldsymbol{R}-\frac{\boldsymbol{t} \boldsymbol{n}^{T}}{d}\right) \boldsymbol{P} \\ & =\boldsymbol{K}\left(\boldsymbol{R}-\frac{\boldsymbol{t} \boldsymbol{n}^{T}}{d}\right) \boldsymbol{K}^{-1} \boldsymbol{p}_{1} . \end{aligned} \tag{1}$
于是, 我们得到了一个直接描述图像坐标 $\boldsymbol{p}_{1}$ 和 $\boldsymbol{p}_{2}$ 之间的变换, 把中间这部分记为 $H$ , 于是
$\boldsymbol{p}_{2}=\boldsymbol{H} \boldsymbol{p}_{1} .$

假设 $\mathbf{P}$ 不是在相机坐标系内，而是在世界坐标系中，则需要计算公式(1)中的 $\mathbf{R},\mathbf{t}$ 。假设这两帧分别是第(1)帧和第(i)帧，则有：
$\begin{array}{c}\left(\begin{array}{cc}R & \mathbf{t} \\ 0 & 1\end{array}\right)=\left(\begin{array}{cc}R_{i} & \mathbf{t}_{i} \\ 0 & 1\end{array}\right)\left(\begin{array}{cc}R_{1} & \mathbf{t}_{1} \\ 0 & 1\end{array}\right)^{-1} \\ \left(\begin{array}{cc}R_{1} & \mathbf{t}_{1} \\ 0 & 1\end{array}\right)^{-1}=\frac{1}{R_{1}}\left(\begin{array}{cc}1 & -\mathbf{t}_{1} \\ 0 & R_{1}\end{array}\right)=\left(\begin{array}{cc}R_{1}^{-1} & -R_{1}^{-1} \mathbf{t}_{1} \\ 0 & 1\end{array}\right) \\ \left(\begin{array}{cc}R & \mathbf{t} \\ 0 & 1\end{array}\right)=\left(\begin{array}{cc}R_{i} & \mathbf{t}_{i} \\ 0 & 1\end{array}\right)\left(\begin{array}{cc}R_{1}^{-1} & -R_{1}^{-1} \mathbf{t}_{1} \\ 0 & 1\end{array}\right)=\left(\begin{array}{cc}R_{i} R_{1}^{-1} & \mathbf{t}_{i}-R_{i} R_{1}^{-1} \mathbf{t}_{1} \\ 0 & 1\end{array}\right)\end{array}$
将上面计算得到的相对外参，代入到公式(1)中，得：Multi-View Stereo中的平面扫描(plane sweep) - 知乎 (zhihu.com)
$H=K_{i}\left(R_{i} R_{1}^{-1}-\frac{\left(\mathbf{t}_{i}-R_{i} R_{1}^{-1} \mathbf{t}_{1}\right) \mathbf{n}_{1}^{T}}{d}\right) K_{1}^{-1} \tag{2}$

平面扫描

Plane Sweeping | 平面扫描 - 知乎 (zhihu.com)

MVSNet

MVSNet: Depth Inference for Unstructured Multi-view Stereo

ECCV2018

摘要

We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed MVSNet is demonstrated on the large-scale indoor DTU dataset.

With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks and Temples dataset, where our method ranks first before April 18, 2018 without any fine-tuning, showing the strong generalization ability of MVSNet.

我们提出了一种端到端的深度学习架构，用于从多视图图像中推断深度图。在网络中，我们首先提取深度视觉图像特征，然后通过differentiable homography warping在参考相机视锥上构建3D cost volume。接下来，我们应用3D卷积来正则化和回归初始深度图，然后用参考图像对其进行refine以生成最终输出。我们的框架使用基于方差的代价度量灵活地适应任意N视图输入，该度量将多个特征映射为一个cost特征。所提出的MVSNet在DTU数据集上进行了演示。

通过简单的后处理，我们的方法不仅显著优于以前的技术，而且在运行时也快了好几倍。我们还在复杂的室外Tanks and Temples数据集上对MVSNet进行了评估，在2018年4月18日之前，我们的方法在没有任何微调的情况下排名第一，显示了MVSNet强大的泛化能力。

引言

我们提出了一种用于深度图推断的端到端深度学习架构，每次计算一个深度图，而不是一次计算整个3D场景。本文提出的网络MVSNet以一张参考图像和多张源图像作为输入，对参考图像进行深度图的推断。这里的insight是可微的单应性变换（differentiable homography warping）运算，该运算隐式编码网络中的相机几何形状，以从2D图像特征构建3D代价体（3D cost volume），并实现端到端训练。为了适应输入中任意数量的源图像，本文提出了一种基于方差的度量，将多个特征映射到volume中的一个代价特征。然后，对cost volume进行多尺度3D卷积并回归初始深度图。最后，利用参考图像对深度图进行细化，提高边界区域的精度。

和以前的学习方法有两个主要的区别: 首先，为了深度图推断的目的，3D cost volume是建立在相机截锥体上，而不是常规的欧氏空间上。其次，本文将MVS重建解耦到更小的逐视图深度图估计问题，这使得大规模重建成为可能。

在这里插入图片描述

MVSNet

在这里插入图片描述

图像特征

MVSNet的第一步是提取N幅输入图像 $\left\{\mathbf{I}_{i}\right\}_{i=1}^{N}$ 的深度特征 $\left\{\mathbf{F}_{i}\right\}_{i=1}^{N}$ 进行密集匹配。采用8层2D CNN，其中第3层和第6层的步长设置为2，将特征塔划分为3个尺度。在每个尺度内，应用两个卷积层来提取更高级的图像表示。除了最后一层外，每个卷积层后面都有一个BN层和一个ReLU。此外，与常见的匹配任务类似，所有特征塔之间共享参数以提高学习效率。

2D网络的输出是N个32通道的特征图，与输入图像相比，每个维度缩小了4x。虽然特征提取后的图像帧被缩小，但每个剩余像素的原始相邻信息已经被编码到32通道像素描述符中，避免了密集匹配丢失有用的上下文信息。与简单地对原始图像进行密集匹配相比，提取的特征图显著提高了重建质量。

Cost Volume

下一步是根据提取的特征图和输入摄像机构建3D cost volume。以前的工作使用规则网格划分空间，但对于深度图推断任务，本文在参考相机截锥体上构建cost volume。

denote $\mathbf{I}_{1}$ as the reference image, $\left\{\mathbf{I}_{i}\right\}_{i=2}^{N}$ the source images, and $\left\{\mathbf{K}_{i}, \mathbf{R}_{i}, \mathbf{t}_{i}\right\}_{i=1}^{N}$ the camera intrinsics, rotations and translations that correspond to the feature maps.

可微单应（Differentiable Homography）

所有的特征图被变换（warped）成不同的参考相机的前平行平面（fronto-parallel planes），形成N个特征体(feature volumes) $\left\{\mathbf{V}_{i}\right\}_{i=1}^{N}$ . 也就是说，将源图像（source image）提取的特征，通过单应变换，变换到参考图像（ref image）的坐标系下，得到 $\left\{\mathbf{V}_{i}\right\}_{i=1}^{N}$ 。

对于深度 $d$ ，从变换后的特征图 $\mathbf{V}_{i}(d)$ 到原始特征图 $\mathbf{F}_{i}$ 的坐标映射由平面变换 $\mathbf{x}^{\prime} \sim \mathbf{H}_{i}(d) \cdot \mathbf{x}$ 确定。 ’ $\sim$ ’ 表示投影等价（projective equality）， $\mathbf{H}_{i}(d)$ 表示在深度 $d$ 上第 $i^{\text {th }}$ 个源图像特征图和参考特征图之间的单应变换。令 $\mathbf{n}_{1}$ 是参考帧相机的主轴（也就是平面的法向量）, 单应变换由 $\times 3$ 矩阵来表示:
$\mathbf{H}_{i}(d)=\mathbf{K}_{i} \cdot \mathbf{R}_{i} \cdot\left(\mathbf{I}-\frac{\left(\mathbf{t}_{1}-\mathbf{t}_{i}\right) \cdot \mathbf{n}_{1}^{T}}{d}\right) \cdot \mathbf{R}_{1}^{T} \cdot \mathbf{K}_{1}^{T}$

上面的公式是论文原公式，但是写错了（代码中没错）。正确的公式是<基础知识>的公式(2)。

参考帧特征图 $\mathbf{F}_{1}$ 本身的单应变换是一个 $3 \times 3$ 的单位阵。变换过程类似于经典的平面扫描立体[5]，不同之处在于使用可微双线性插值从特征图 $\left\{\mathbf{F}_{i}\right\}_{i=1}^{N}$ 而不是图像 $\left\{\mathbf{I}_{i}\right\}_{i=1}^{N}$ 中采样像素。

这个变换操作是连接2D特征提取和3D正则化网络的核心步骤，以可微方式实现，可实现深度图推理的端到端训练。

代价度量（Cost Metric）

Cost Metric聚合多个特征体 $\left\{\mathbf{V}_{i}\right\}_{i=1}^{N}$ 成一个cost volume $\mathbf{C}$ 。为了适应任意数量的输入视图，本文提出了一个基于方差的代价度量 $\mathcal{M}$ 来测量n视图的相似性。Let $W, H, D, F$ be the input image width, height, depth sample number and the channel number of the feature map, and $V=\frac{W}{4} \cdot \frac{H}{4} \cdot D \cdot F$ the feature volume size, 则本文的代价度量为 $\mathcal{M}: \underbrace{\mathbb{R}^{V} \times \cdots \times \mathbb{R}^{V}}_{N} \rightarrow \mathbb{R}^{V}$ :
$\mathbf{C}=\mathcal{M}\left(\mathbf{V}_{1}, \cdots, \mathbf{V}_{N}\right)=\frac{\sum_{i=1}^{N}\left(\mathbf{V}_{i}-\overline{\mathbf{V}_{i}}\right)^{2}}{N}$
其中 $\overline{\mathbf{V}_{i}}$ 是所有特征体中的平均体，以上所有操作都是逐元素的。方差越小，说明在该深度上置信度越高。

大多数传统的MVS方法以启发式的方式汇总参考图像和所有源图像之间的pairwise代价。而本文的度量设计遵循的理念是，所有视图对匹配代价的贡献应该相等，并且不优先考虑参考图像。

最近的工作应用了多个CNN层的均值运算来推断multi-patch相似度。而这里选择“方差”操作，因为“均值”操作本身没有提供关于特征差异的信息，并且他们的网络需要预处理和后CNN层来帮助推断相似性。基于方差的代价度量明确地度量了多视图特征差异。

Cost Volume Regularization

从图像特征计算的原始cost volume可能受到噪声污染(例如，由于非朗伯曲面或物体遮挡的存在)，并且应该结合平滑约束来推断深度图。本文正则化步骤是为了refine上面的cost volume $\mathbf{C}$ 来生成深度推断的概率体 $\mathbf{P}$ 。

受最近learning-based的Stereo[17]和MVS[14,15]方法的启发，本文将多尺度3D CNN用于cost volume正则化。这里的4尺度网络类似于3D版本的UNet[31]，它使用编码器-解码器结构来聚合来自大感受野的相邻信息，内存和计算成本相对较低。

为了进一步减少计算量，在第一个3D卷积层之后将32通道的cost volume减少到8通道，并将每个尺度内的卷积从3层改为2层。最后一个卷积层输出一个1通道的volume。最后，沿着深度方向应用softmax操作进行概率归一化。

得到的概率体在深度图推断中是非常理想的，它不仅可以用于逐像素深度估计，而且可以用于测量估计置信度。通过分析深度重建的概率分布，可以很容易地确定深度重建的质量，因此可以实现非常简洁但有效的离群滤波策略。

深度图

在这里插入图片描述

Initial Estimation

从概率体积 $\mathbf{P}$ 中检索深度图 $\mathbf{D}$ 的最简单方法是像素赢家通吃算法(即argmax)。argmax运算不能产生亚像素估计，由于其不可微性，不能用反向传播训练。而本文沿深度方向计算期望值，即所有假设的概率加权和:
$\mathbf{D}=\sum_{d=d_{\min }}^{d_{\max }} d \times \mathbf{P}(d)$
其中 $\mathbf{P}(d)$ 为深度d处所有像素的概率估计。该操作在[17]中也被称为soft argmin操作。它是完全可微的，并且能够近似argmax的结果。虽然在cost volume构建过程中，深度假设在 $d_{min}, d_{max}]$ 范围内均匀采样，但这里的期望值能够产生连续的深度估计。输出深度图(图2 (b))与2D图像特征图大小相同，与输入图像相比，每个维度都缩小了4x。

Probability Map

沿深度方向的概率分布也反映了深度估计的质量。虽然多尺度3D CNN具有很强的概率正则化到单模型分布的能力，对于那些错误匹配的像素，它们的概率分布是分散的，不能集中到一个峰值(见图2 ©)。基于这一观察，我们将深度估计的质量 $\hat{d}$ -定义为GT深度在估计附近的一个小范围内的概率。

由于深度假设是沿着相机截锥体离散采样的，我们简单地对4个最近的深度假设进行概率和来衡量估计质量。其他统计测量，如标准差或熵也可以在这里使用，但在实验中我们观察到这些测量对深度图滤波没有显著的改善。此外，本文的概率和公式可以更好地控制外点滤波的阈值参数。

Depth Map Refifinement

虽然从概率体中检索到的深度图是一个合格的输出，但由于正则化过程中涉及的感受野很大，重构边界可能会出现过平滑的问题，这与语义分割[4]和图像抠图[37]中的问题类似。

注意到自然中的参考图像包含边界信息，因此本文使用参考图像作为指导来细化深度图。受最近的图像抠图算法[37]的启发，我们在MVSNet的末端应用了深度残差学习网络。将初始深度图和调整大小的参考图像连接为4通道输入，然后通过3个32通道的2D卷积层，然后通过一个1通道的卷积层来学习深度残差。然后将初始深度图添加回生成精细的深度图。最后一层不包含BN层和ReLU单元，以便学习负残差。此外，为了防止在某个深度尺度上出现偏差，我们将初始深度幅度预缩放到范围[0,1]，并在细化后将其转换回来。

Loss

同时考虑了初始深度图和refined深度图的损失。我们使用真实深度图和估计深度图之间的e mean absolute difference作为我们的训练损失。由于GT深度图在整个图像中并不总是完整的(参见4.1节)，我们只考虑那些具有有效GT标签的像素:
$\text { Loss }=\sum_{p \in \mathbf{p}_{\text {valid }}} \underbrace{\left\|d(p)-\hat{d}_{i}(p)\right\|_{1}}_{\text {Loss } 0}+\lambda \cdot \underbrace{\left\|d(p)-\hat{d}_{r}(p)\right\|_{1}}_{\text {Loss } 1}$
其中 $\mathbf{p}_{\text {valid }}$ 表示the set of valid ground truth pixels, $d (p)$ the ground truth depth value of pixel $\hat{d}_{i}(p)$ the initial depth estimation and $\hat{d}_{r}(p)$ the refined depth estimation. The parameter $\lambda$ is set to 1.0 in experiments.

Point-Based Multi-View Stereo Network

ICCV2019

摘要

We introduce Point-MVSNet, a novel point-based deep framework for multi-view stereo (MVS). Distinct from existing cost volume approaches, our method directly processes the target scene as point clouds. More specifically, our method predicts the depth in a coarse-to-fine manner. We first generate a coarse depth map, convert it into a point cloud and refine the point cloud iteratively by estimating the residual between the depth of the current iteration and that of the ground truth. Our network leverages 3D geometry priors and 2D texture information jointly and effectively by fusing them into a feature-augmented point cloud, and processes the point cloud to estimate the 3D flow for each point. This point-based architecture allows higher accuracy, more computational efficiency and more flexibility than cost-volume-based counterparts. Experimental results show that our approach achieves a significant improvement in reconstruction quality compared with state-of-the-art methods on the DTU and the Tanks and Temples dataset.Our source code and trained models are available at https://github.com/callmeray/PointMVSNet

介绍了一种新的基于点的多视点立体深度框架Point-MVSNet。与现有的cost volume方法不同，我们的方法直接将目标场景处理为点云。更具体地说，我们的方法以coarse-to-fine的方式预测深度。首先生成粗深度图，将其转换为点云，并通过估计当前迭代深度与GT深度之间的残差来迭代细化点云。

我们的网络将三维几何先验和二维纹理信息融合成一个特征增强的点云，并对点云进行处理以估计每个点的3D flow。这种基于点的体系结构比基于cost-volume的体系结构具有更高的精度、更高的计算效率和更大的灵活性。实验结果表明，与DTU和Tanks and Temples数据集上的最新方法相比，我们的方法在重建质量方面取得了显着提高。

引言

在这项工作中提出了一种新的点云多视图立体网络，其中目标场景直接作为点云进行处理，这是一种更有效的表示，特别是当3D分辨率很高时。框架由两步组成:

首先，为了从整个场景中雕刻出近似的物体表面，通过相对较小的3Dcost volume生成初始粗深度图，然后转换为点云。
随后，将PointFlow模块应用于从初始点云迭代回归精确的稠密点云。

与ResNet类似，我们明确地制定了PointFlow来预测当前迭代深度与ground truth深度之间的残差。3D flow是基于从预测点云推断的几何先验和从多视图输入图像动态获取的2D图像外观线索来估计的(图1)。

在这里插入图片描述

Point-MVSNet框架与之前的MVS方法相比，在准确性、效率和灵活性方面具有优势，这些方法是建立在一个预定义的3D volume上，具有固定的分辨率来聚合视图信息。

我们的方法在3D空间中自适应样本潜在的表面点。它自然地保持了表面结构的连续性，这是高精度重建所必需的。

此外，由于我们的网络只处理物体表面附近的有效信息，而不是像3D CNN那样处理整个3D空间，因此计算效率要高得多。

最后，自适应细化方案允许首先以粗分辨率窥视场景，然后仅在感兴趣的区域密度重建点云。对于面向交互的机器人视觉等场景，这种灵活性将节省计算能力。

方法

该方法可分为粗深度预测和迭代深度细化两步。记 $\mathbf{I}_0$ 为参考图像， $\{\mathbf{I}_i \}_{i=1}^N$ 表示相邻的源图像。

首先生成 $\mathbf{I}_0$ 的粗深度图。由于分辨率较低，现有的volumetric MVS方法具有足够的效率，可以使用。

其次，介绍了2D-3D特征提升（lifting），它将2D图像信息与3D几何先验相关联。

然后，提出新颖的PointFlow模块，迭代地将输入深度图细化到更高的分辨率，提高精度。

在这里插入图片描述

粗深度预测

最近， learning-based MVS[12,29,11]利用多尺度3D CNN在cost volume正则化上取得了最先进的性能。然而，这一步骤可能非常昂贵，因为随着cost volume分辨率的增长，内存需求也在以三次方的速度增长。考虑到内存和时间，这里使用最近提出的MVSNet[29]来预测相对低分辨率的cost volume。

给定图像和内外参，MVSNet[29]在参考相机截锥体上构建三维cost volume。然后通过多尺度3D CNN和soft argmin[15]运算对参考视图的初始深度图进行回归。在MVSNet中，特征图在每个维度上被下采样到原始输入图像的1/4，用于训练和评估的虚拟深度平面数量为256。另一方面，在我们的粗深度估计网络中，cost volume由参考图像大小的1/8的特征图构成，分别包含48个或96个虚拟深度平面用于训练和评估。因此，我们对这个3D特征体的内存使用大约是MVSNet的1/20。

2D-3D特征提升

Image Feature Pyramid

基于学习的图像特征被证明是提高稠密像素对应质量的关键。为了在多个尺度上赋予点更大的上下文信息感受野，我们构建了一个3-scale特征金字塔。采用步幅为2的2D卷积网络对特征图进行下采样，提取下采样前的每一层，构建图像 $\mathbf{I}_{i}$ 的最终特征金字塔 $\mathbf{F}_{i}=\left[\mathbf{F}_{i}^{1}, \mathbf{F}_{i}^{2}, \mathbf{F}_{i}^{3}\right]$ 。与常见的MVS方法相似，特征金字塔在所有输入图像之间是共享的。

Dynamic Feature Fetching

网络中使用的点特征折衷了(compromised)提取的多视点图像特征与世界空间中归一化三维坐标 $\mathbf{X}_p$ 的方差。

在给定相应的相机参数的情况下，利用可微反投影（differentiable unprojection）从多视图特征图中提取每个3D点的图像外观特征。特征 $\mathbf{F}_{i}^{1}, \mathbf{F}_{i}^{2}, \mathbf{F}_{i}^{3}$ 在不同的图像分辨率下，因此相机内参矩阵应该在特征图的每个level上进行缩放，以正确地进行特征变换。

与MVSNet类似，我们保持基于方差的代价度量，即不同视图之间的特征方差，以聚合从任意数量的视图变换的特征。对于层级 $j$ 的金字塔特征，N个视图的方差度量定义如下:
$\mathbf{C}^{j}=\frac{\sum_{i=1}^{N}\left(\mathbf{F}_{i}^{j}-\overline{\mathbf{F}^{j}}\right)^{2}}{N},(j=1,2,3) \tag{1}$
为了形成驻留在每个3D点上的特征，将获取的图像特征和归一化的点坐标进行串联:
$\mathbf{C}_{p}=\operatorname{concat}\left[\mathbf{C}_{p}^{j}, \mathbf{X}_{p}\right],(j=1,2,3)$
这个带有特征的点 $\mathbf{C}_{p}$ 是PointFlow模块的输入。

在下一节中可以看到，由于是迭代地预测深度残差，因此在每次迭代后更新点位置 $\mathbf{X}_{p}$ ，并从图像特征金字塔中获取点特征 $\mathbf{C}_{p}^k$ ，我们将此操作称为动态特征提取(dynamic feature fetching)。这一步不同于cost-volume-based方法，后者在每个体素上获取的特征由场景的固定空间划分确定。相比之下，我们的方法可以根据更新的点位置动态地从图像的不同区域提取特征。因此，我们可以专注于特征图中感兴趣的区域，而不是统一地处理它们。

PointFlow

3.1小节生成的深度图由于3Dcost volume的空间分辨率较低，精度有限。本文提出PointFlow，用于迭代改进深度图。在已知相机参数的情况下，首先将深度图反投影为3D点云。对于每个点，我们的目标是通过从各个视角观察其相邻点，估计其沿参考相机方向到ground truth表面的位移，从而推动这些点流向目标表面。

点假设生成Point Hypotheses Generation

从提取的图像特征图中回归每个点的深度位移是很重要的。由于透视变换，嵌入在2D特征图中的空间上下文不能反映3D欧几里得空间中的接近性（proximity）。为了有利于神经网络建模，我们提出沿参考相机方向生成一系列位移不同的点假设 $\mathbf{\tilde{p}}$ ，如图3所示。设 $\mathbf{t}$ 表示归一化参考相机方向， $s$ 表示位移步长。对于反投影点 $\mathbf{{p}}$ ，其假设点集 $\{ \mathbf{\tilde{p}}_k \}$ 为:

$\tilde{\mathbf{p}}_{k}=\mathbf{p}+k s \mathbf{t}, \quad k=-m, \ldots, m$
这些点假设是网络推断位移的关键，因为在这些点上收集了不同深度所需的邻域图像特征信息以及空间几何关系。

在这里插入图片描述

边缘卷积Edge Convolution

经典的MVS方法已经证明了局部邻域对于鲁棒深度预测的重要性。同样，我们采用了最近的工作DGCNN[28]的策略来丰富相邻点之间的特征聚合。如图3所示，利用k个最近邻(kNN)在点集上构造一个有向图，利用局部几何结构信息进行点的特征传播。

记带特征的点云（the feature augmented point cloud）为 $\mathbf{C}_{\tilde{p}}=$ $\left\{\mathbf{C}_{\tilde{p}_{1}}, \ldots, \mathbf{C}_{\tilde{p}_{n}}\right\}$ , 则边缘卷积定义为:
$\mathbf{C}_{\tilde{p}}^{\prime}=\underset{q \in k N N(\tilde{p})}{\square} h_{\Theta}\left(\mathbf{C}_{\tilde{p}}, \mathbf{C}_{\tilde{p}}-\mathbf{C}_{q}\right)$
其中 $h_{\Theta}$ 是一个可学习的非线性函数，由 $\Theta$ 参数化， $\square$ 是一个通道对称聚合操作。对称操作有多个选项，包括最大池化、平均池化和加权和。我们比较了最大池化和平均池化，并在仔细调优超参数后观察到相似的性能。

流预测Flow Prediction

流预测的网络架构如图4所示。输入是特征增强点云（ feature augmented point cloud），输出是深度残差图（depth residual map）。使用三个EdgeConv层来聚合不同邻域尺度上的点特征。Shortcut connections用于将所有EdgeConv输出合并为局部点特征。最后，利用共享MLP对点特征进行变换，在每个反投影点的假设点之间输出一个带有softmax的概率标量。将反投影点的位移预测为所有预测点假设位移的概率加权和:
$\Delta d_{p}=\mathbf{E}(k s)=\sum_{k=-m}^{m} k s \times \operatorname{Prob}\left(\tilde{\mathbf{p}}_{k}\right)$
注意这个运算是可微的。输出深度残差图是将位移反投影得到的，将残差图添加到初始输入深度图中进行深度细化。

在这里插入图片描述

Iterative Refifinement with Upsampling

由于 point-based网络架构的灵活性，流预测可以迭代进行，这对于cost-volume-based的方法来说要困难得多，因为cost volume构建后空间划分是固定的。对于粗预测或前面残差预测得到的深度图 $\mathbf{D}^{(i)}$ ，先用最近邻法上采样到更高的空间分辨率，然后进行流预测得到 $\mathbf{D}^{(i+1)}$ 。此外，在每次迭代中减小了反投影点和假设点之间的深度间隔 $s$ ，以便通过从更接近的点假设中捕获更详细的特征来预测更准确的位移。

Training loss

与大多数深度MVS网络类似，我们将该问题视为一个回归任务，并使用L1损失来训练网络，L1损失测量预测深度图与真实深度图之间的绝对差值。初始深度图和迭代改进深度图的损失都被考虑在内:
$\text { Loss }=\sum_{i=0}^{l}\left(\frac{\lambda^{(i)}}{s^{(i)}} \sum_{p \in \mathbf{P}_{\text {valid }}}\left\|\mathbf{D}_{\mathrm{GT}}(p)-\mathbf{D}^{(i)}(p)\right\|_{1}\right)$
其中 $\mathbf{P}_{\text {valid }}$ 表示有效的GT像素集合，d $l$ is the iteration number. The weight $\lambda^{(i)}$ is set to 1.0 in training.

CVP-MVSNet

CVPR2020

摘要

We propose a cost volume-based neural network for depth inference from multi-view images. We demonstrate
that building a cost volume pyramid in a coarse-to-fine manner instead of constructing a cost volume at a fixed resolution leads to a compact, lightweight network and allows us inferring high resolution depth maps to achieve better reconstruction results. To this end, we first build a cost volume based on uniform sampling of fronto-parallel planes across the entire depth range at the coarsest resolution of an image. Then, given current depth estimate, we construct new cost volumes iteratively on the pixelwise depth residual to perform depth map refinement. While sharing similar insight with Point-MVSNet as predicting and refining depth iteratively, we show that working on cost volume pyramid can lead to a more compact, yet efficient network structure compared with the Point-MVSNet on 3D points. We further provide detailed analyses of the relation between (residual) depth sampling and image resolution, which serves as a principle for building compact cost volume pyramid. Experimental results on benchmark datasets show that our model can perform 6x faster and has similar performance as state-of-the-art methods. Code is available at https://github.com/JiayuYANG/CVP-MVSNet.

我们提出一种基于cost volume的神经网络，用于多视图图像的深度推断。我们证明了以一种coarse-to-fine的方式构建cost volume金字塔，而不是以固定分辨率构建cost volume，可以得到一个紧凑、轻量级的网络，并允许推断高分辨率深度图，以获得更好的重建结果。为此，首先以图像的最粗分辨率在整个深度范围内对前平行平面(fronto-parallel planes)进行均匀采样，以此为基础构建cost volume。然后，给定当前深度估计，我们在像素深度残差上迭代构建新的cost volumes来执行深度图细化。虽然与Point-MVSNet在迭代预测和细化深度方面有着相似的见解，但本文表明，与Point-MVSNet在3D点上相比，处理cost volume金字塔可以产生更紧凑、更高效的网络结构。我们进一步详细分析了(残差)深度采样与图像分辨率之间的关系，这是构建紧凑的cost volume金字塔的原则。在基准数据集上的实验结果表明，我们的模型的执行速度可以提高6倍，并且具有与最先进的方法相似的性能。

引言

在这里插入图片描述

为了实现计算高效的网络，Point-Based Multi-View Stereo Network研究3D点云，使用对每个3D点的k个最近邻进行边缘卷积运算，迭代预测沿视觉射线的深度残差。虽然这种方法是有效的，但它的运行时间几乎随着迭代级别的数量线性增加。

这项工作提出了一种用于深度推理的cost-volume-based金字塔的多视图立体网络（CVP-MVSNet）。

首先为每个输入图像构建一个图像金字塔。
然后，对于参考图像的最粗糙分辨率，通过在场景的整个深度范围内对深度进行采样来构建紧凑的cost-volume。
之后，在下一个金字塔级别，从当前深度估计的附近执行残差深度搜索，以使用多尺度3D CNN构建 partial cost volume进行正则化。

在每个层级上以较短的搜索范围迭代构建这些cost-volume时，就会形成一个小型紧凑的网络。因此，我们的网络在基准数据集上的性能是当前最先进网络的6倍。

虽然本文与[5]有着相似的见解，即以coarse-to-fine的方式预测和细化深度图，但本文与他们在以下四个主要方面不同：

首先，Point-based MVSNet在3D点云上执行卷积。相反，我们在图像坐标上定义的规则网格上构建cost-volume，这在运行时表现得更快。
其次，基于深度采样和图像分辨率之间的相关性，本文提供了一种构建紧凑cost-volume金字塔的原理。
使用多尺度3D-CNN正则化来覆盖大的感受野，并鼓励残差深度估计的局部平滑，如图1所示，更好的精度。
最后，与其他相关工作相比，我们的方法可以用小分辨率图像输出小分辨率的深度。

方法

整个系统如图2所示。假设参考图像记为 $\mathbf{I}_{0} \in \mathbb{R}^{H \times W}$ , where $H$ and $W$ define its dimensions. Let $\left\{\mathbf{I}_{i}\right\}_{i=1}^{N}$ be its $N$ neighboring source images. Assume $\left\{\mathbf{K}_{i}, \mathbf{R}_{i}, \mathbf{t}_{i}\right\}_{i=0}^{N}$ are the corresponding camera intrinsics, rotation matrix, and translation vector for all views.

本文的目标是从 $\left\{\mathbf{I}_{i}\right\}_{i=0}^{N}$ 推断 $\mathbf{I}_{0}$ 的深度图 $\mathbf{D}$ 。本方法的关键之处在于在cost volume金字塔上使用了一个前馈深度网络，该网络以coarse-to-fine的方式构建。

在这里插入图片描述

特征金字塔

由于原始图像随光照变化而变化，我们采用可学习特征，这已被证明是提取稠密特征对应的关键步骤[42,38]。现有的一般做法是利用高分辨率图像提取多尺度图像特征，即使输出低分辨率深度图。相比之下，我们表明低分辨率图像包含足够的信息，用于估计低分辨率深度图。

特征提取管道由两个步骤组成，见图2。

构建图像金字塔：First, we build the $(L + 1)$ -level image pyramid $\left\{\mathbf{I}_{i}^{j}\right\}_{j=0}^{L}$ for each input image, $\in\{0,1, \cdots, N\}$ , where the bottom level of the pyramid corresponds to the input image, $\mathbf{I}_{i}^{0}=\mathbf{I}_{i}$ .
特征提取：Second, we obtain feature representations at the $l$ -th level using a CNN, namely feature extraction network. Specifically, it consists of 9 convolutional layers, each of which is followed by a leaky rectified linear unit (Leaky-ReLU). We use the same CNN to extract features for all levels in all the images. We denote the feature maps for a given level $l$ by $\left\{\mathbf{f}_{i}^{l}\right\}_{i=0}^{N}, \mathbf{f}_{i}^{l} \in \mathbb{R}^{H / 2^{l} \times W / 2^{l} \times F}$ , where $F = 16$ is the number of feature channels used in our experiments.

与现有的工作相比，本文特征提取管道显著降低了内存需求，同时提高了性能。

其官方代码实现如下：

# Feature pyramid
class FeaturePyramid(nn.Module):
    def __init__(self):
        super(FeaturePyramid, self).__init__()
        self.conv0aa = conv(3, 64, kernel_size=3, stride=1)
        self.conv0ba = conv(64,64, kernel_size=3, stride=1)
        self.conv0bb = conv(64,64, kernel_size=3, stride=1)
        self.conv0bc = conv(64,32, kernel_size=3, stride=1)
        self.conv0bd = conv(32,32, kernel_size=3, stride=1)
        self.conv0be = conv(32,32, kernel_size=3, stride=1)
        self.conv0bf = conv(32,16, kernel_size=3, stride=1)
        self.conv0bg = conv(16,16, kernel_size=3, stride=1)
        self.conv0bh = conv(16,16, kernel_size=3, stride=1)

    def forward(self, img, scales=5):
        fp = []
        f = self.conv0aa(img)
        f = self.conv0bh(self.conv0bg(self.conv0bf(self.conv0be(self.conv0bd(self.conv0bc(self.conv0bb(self.conv0ba(f))))))))
        fp.append(f)
        for scale in range(scales-1):
            img = nn.functional.interpolate(img,scale_factor=0.5,mode='bilinear',align_corners=None).detach()
            f = self.conv0aa(img)
            f = self.conv0bh(self.conv0bg(self.conv0bf(self.conv0be(self.conv0bd(self.conv0bc(self.conv0bb(self.conv0ba(f))))))))
            fp.append(f)

        return fp

Cost Volume Pyramid

给定提取的特征，下一步是在参考视图中构建深度推断的cost volume。常见的方法通常是在固定分辨率下构建单个cost volume[16,42,43]，这会产生大量内存需求，从而限制了高分辨率图像的使用。本文提出建立一个cost volume pyramid，迭代估计和refines深度图以实现高分辨率深度推断的过程。

首先基于图像金字塔中最粗分辨率的图像和场景中前平行平面的均匀采样建立了粗略深度图估计的cost volume。
然后，基于粗糙估计和深度残差假设迭代构造 partial cost volumes，得到更高分辨率和精度的深度图。

Cost Volume for Coarse Depth Map Inference.

一开始，在最低图像分辨率 $\left(H / 2^{L}, W / 2^{L}\right)$ 对应的第 $L$ 级上构建一个cost volume。假设在场景的参考视图处测量的深度范围为 $d_{\text {min }}$ to $d_{\max }$ . 通过在整个深度范围内均匀地对 $M$ 个前平行平面（fronto-parallel planes）进行采样来构建参考视图的cost volume。一个采样深度 $d=d_{\min }+m\left(d_{\max }-\right.$ $\left.d_{\min }\right) / M, m \in\{0,1,2, \cdots, M-1\}$ 表示一个平面，且该平面的法向量 $\mathbf{n}_{0}$ 是参考相机的主轴。

类似于MVSNet，将深度 $d$ 处第 $i$ 个源视图和参考视图之间的可微单应变换 $\mathbf{H}_{i}(d)$ 定义为
$\mathbf{H}_{i}(d)=\mathbf{K}_{i}^{L} \mathbf{R}_{i}\left(\mathbf{I}-\frac{\left(\mathbf{t}_{0}-\mathbf{t}_{i}\right) \mathbf{n}_{0}^{T}}{d}\right) \mathbf{R}_{0}^{-1}\left(\mathbf{K}_{0}^{L}\right)^{-1},$
where $\mathbf{I}$ is the identity matrix, and $\mathbf{K}_{i}^{L}$ and $\mathbf{K}_{0}^{L}$ are the scaled intrinsic matrices of $\mathbf{K}_{i}$ and $\mathbf{K}_{0}$ at level $L$ .

每个单应变换 $\mathbf{H}_{i}(d)$ 表明源视图 $i$ 中的 $\tilde{\mathbf{x}}_{i}$ 和参考视图中的像素 $\mathbf{x}$ 之间可能的像素对应关系。这种对应定义为 $\lambda_{i} \tilde{\mathbf{x}}_{i}=\mathbf{H}_{i}(d) \mathbf{x}$ , 其中 $\lambda_{i}$ 表示 $\tilde{\mathbf{x}}_{i}$ 在源视图 $i$ 下的深度。

给定 $\tilde{\mathbf{x}}_{i}$ and $\left\{\mathbf{f}_{i}^{L}\right\}_{i=1}^{N}$ , 使用可微双线性插值构造一个被变换到参考视图下的特征图 $\left\{\tilde{\mathbf{f}}_{i, d}^{L}\right\}_{i=1}^{N}$ . The cost for all pixels at depth $d$ is defined as its variance of features from $N + 1$ views,
$\mathbf{C}_{d}^{L}=\frac{1}{(N+1)} \sum_{i=0}^{N}\left(\tilde{\mathbf{f}}_{i, d}^{L}-\overline{\mathbf{f}}_{d}^{L}\right)^{2}$
其中 $\tilde{\mathbf{f}}_{0, d}^{L}=\mathbf{f}_{0}^{L}$ 是参考视图的每个像素的特征图， $\overline{\mathbf{f}}_{d}^{L}$ 是每个像素在所有视图 $\left(\left\{\tilde{\mathbf{f}}_{i, d}^{L}\right\}_{i=1}^{N} \cup \mathbf{f}_{0}^{L}\right)$ 下的平均特征体（feature volumes）. 该metric鼓励每个像素的正确深度具有最小的特征方差，这与光度一致性约束相对应。We compute the cost map for each depth hypothesis and concatenate those cost maps to a single cost volume $\mathbf{C}^{L} \in \mathbb{R}^{W / 2^{L} \times H / 2^{L} \times M \times F}$

获得良好深度估计精度的关键参数是深度采样分辨率 $M$ 。在第3.3节中展示如何确定深度采样和粗略深度估计的间隔。

单应变换的代码实现：

def homo_warping(src_feature, ref_in, src_in, ref_ex, src_ex, depth_hypos):
    # Apply homography warpping on one src feature map from src to ref view.

    batch, channels = src_feature.shape[0], src_feature.shape[1]
    num_depth = depth_hypos.shape[1]
    height, width = src_feature.shape[2], src_feature.shape[3]

    with torch.no_grad():
        src_proj = torch.matmul(src_in,src_ex[:,0:3,:])
        ref_proj = torch.matmul(ref_in,ref_ex[:,0:3,:])
        last = torch.tensor([[[0,0,0,1.0]]]).repeat(len(src_in),1,1).cuda()
        src_proj = torch.cat((src_proj,last),1)
        ref_proj = torch.cat((ref_proj,last),1)

        proj = torch.matmul(src_proj, torch.inverse(ref_proj))
        rot = proj[:, :3, :3]  # [B,3,3]
        trans = proj[:, :3, 3:4]  # [B,3,1]

        y, x = torch.meshgrid([torch.arange(0, height, dtype=torch.float32, device=src_feature.device),
                               torch.arange(0, width, dtype=torch.float32, device=src_feature.device)])
        y, x = y.contiguous(), x.contiguous()
        y, x = y.view(height * width), x.view(height * width)
        xyz = torch.stack((x, y, torch.ones_like(x)))  # [3, H*W]
        xyz = torch.unsqueeze(xyz, 0).repeat(batch, 1, 1)  # [B, 3, H*W]
        rot_xyz = torch.matmul(rot, xyz)  # [B, 3, H*W]
        rot_depth_xyz = rot_xyz.unsqueeze(2).repeat(1, 1, num_depth, 1) * depth_hypos.view(batch, 1, num_depth,1)  # [B, 3, Ndepth, H*W]
        proj_xyz = rot_depth_xyz + trans.view(batch, 3, 1, 1)  # [B, 3, Ndepth, H*W]
        proj_xy = proj_xyz[:, :2, :, :] / proj_xyz[:, 2:3, :, :]  # [B, 2, Ndepth, H*W]
        proj_x_normalized = proj_xy[:, 0, :, :] / ((width - 1) / 2) - 1
        proj_y_normalized = proj_xy[:, 1, :, :] / ((height - 1) / 2) - 1
        proj_xy = torch.stack((proj_x_normalized, proj_y_normalized), dim=3)  # [B, Ndepth, H*W, 2]
        grid = proj_xy

    warped_src_fea = F.grid_sample(src_feature, grid.view(batch, num_depth * height, width, 2), mode='bilinear',
                                   padding_mode='zeros')
    warped_src_fea = warped_src_fea.view(batch, channels, num_depth, height, width)

    return warped_src_fea

计算cost volume:

ref_volume = ref_feature_pyramid[-1].unsqueeze(2).repeat(1, 1, len(depth_hypos[0]), 1, 1)

volume_sum = ref_volume
volume_sq_sum = ref_volume.pow_(2)
if self.args.mode == "test":
    del ref_volume
for src_idx in range(self.args.nsrc):
    # warpped features
    warped_volume = homo_warping(src_feature_pyramids[src_idx][-1], ref_in_multiscales[:,-1], src_in_multiscales[:,src_idx,-1,:,:], ref_ex, src_ex[:,src_idx], depth_hypos)

    if self.args.mode == "train":
        volume_sum = volume_sum + warped_volume
        volume_sq_sum = volume_sq_sum + warped_volume ** 2
    elif self.args.mode == "test":
        volume_sum = volume_sum + warped_volume
        volume_sq_sum = volume_sq_sum + warped_volume ** 2
        del warped_volume
    else: 
        print("Wrong!")
        pdb.set_trace()

# Aggregate multiple feature volumes by variance
cost_volume = volume_sq_sum.div_(self.args.nsrc+1).sub_(volume_sum.div_(self.args.nsrc+1).pow_(2))

Cost Volume for Multi-scale Depth Residual Inference.

最终目标是获得 $\mathbf{D}=\mathbf{D}^{0}$ for $\mathbf{I}_{0}$ . 从 $\mathbf{D}^{l+1}$ 开始迭代， $\mathbf{D}^{l+1}$ 是第 $(l + 1)$ 级的深度估计，以获得下一级 $\mathbf{D}^{l}$ 的refined深度图，直到到达底部。

首先通过双三次插值将样本 $\mathbf{D}^{l+1}$ 上采样到下一个水平 $\mathbf{D}_{\uparrow}^{l+1}$ ，然后构建partial cost volume来回归定义为 $\Delta \mathbf{D}^{l}$ 的残差深度图（residual depth map），以获得第 $l$ 层级的精细深度图 $\mathbf{D}^{l}=\mathbf{D}_{\uparrow}^{l+1}+\Delta \mathbf{D}^{l}$ 。

虽然本文与Point-based MVSNet都进行迭代预测深度残差，但后者在点云上执行卷积，而本文在深度残差（ depth residual）上构建规则的3D cost volume，然后进行多尺度3D卷积，可以得到更紧凑、更快、更高精度的深度推断。这里动机是相邻像素的深度位移（depth displacements）是相关的，这表明规则的多尺度3D卷积将为深度残差估计提供有用的上下文信息。因此，将深度位移假设安排在规则的3D空间中，并计算cost volume。

假设给出了所有相机视图的相机参数 $\left\{\mathbf{K}_{i}^{l}, \mathbf{R}_{i}, \mathbf{t}_{i}\right\}_{i=0}^{N}$ 和上采样深度估计 $\mathbf{D}_{\uparrow}^{l+1}$ . 每个像素 $\mathbf{p}=(u, v)$ 的当前深度估计定义为： $d_{\mathbf{p}}=\mathbf{D}_{\uparrow}^{l+1}(u, v)$ . 设每个深度残差假设间隔（depth residual hypothesis interval）为 $\Delta d_{\mathbf{p}}=s_{\mathbf{p}} / M$ ， $s_{\mathbf{p}}$ 表示在 $\mathbf{p}$ 处的深度搜索范围， $M$ 深度残差的采样数量。本文认为具有深度 $\left(\mathbf{D}_{\uparrow}^{l+1}(u, v)+m \Delta d_{\mathbf{p}}\right)$ 的相应假设3D点在视图 $i$ 中的投影为

$\lambda_{i} \mathbf{x}_{i}^{\prime}=\mathbf{K}_{i}^{l}\left(\mathbf{R}_{i} \mathbf{R}_{0}^{-1}\left(\left(\mathbf{K}_{0}^{l}\right)^{-1}(u, v, 1)^{T}\left(d_{\mathbf{p}}+m \Delta d_{\mathbf{p}}\right)-\mathbf{t}_{0}\right)+\mathbf{t}_{i}\right),$
where $\lambda_{i}$ denotes the depth of corresponding pixel in view $i$ , 且 $\in\{-M / 2, \cdots, M / 2-1\}$ (Fig. 3).

然后，基于等式2类似地定义每个深度残差假设处的该像素的cost，得到partial cost volume $\mathbf{C}^{l} \in \mathbb{R}^{H / 2^{l} \times W / 2^{l} \times M \times F}$ .

在这里插入图片描述

在下一节中将介绍求解过程，以确定所有像素的深度搜索间隔和范围 $s_{\mathbf{p}}$ ，这对于获得准确的深度估计至关重要。

Depth Map Inference

在本节中，首先提供了在最粗图像分辨率下进行深度采样的细节，以及在更高图像分辨率下对局部深度搜索范围进行离散化，以构建cost volume。然后在cost volumes上引入深度图估计器来实现深度图推断。

Depth Sampling for Cost Volume Pyramid

观察到虚拟深度平面的深度采样与图像分辨率有关。如图4所示，由于图像中被采样的3D点的投影过于接近，无法为深度推断提供额外的信息，因此无需密集地采样样本深度平面。

在这里插入图片描述

为了确定虚拟平面的数量，我们计算了图像中相应0.5像素距离的平均深度采样间隔。

为了确定每个像素当前深度估计周围深度残差的局部搜索范围，首先将其3D点投影到源视图中，在两个方向上沿极线(见图3 “2像素长度” )找到距离其投影两个像素的点，然后将这两个点投影到3D射线中。这两条射线与参考视图中视线的交点决定了当前层级上depth refine的搜索范围。

补充公式推导（自己推的）：

假设有世界坐标点 $\mathbf{X}_{i,k}$ ，参考视角的像素坐标 $\bold{\tilde{x}}$ ，源视图的像素坐标 $\mathbf{x}_{i,k}$ ，有：
$d_{i,k} \bold{\tilde{x}} = \mathbf{K}_{0}^{l} \mathbf{R}_{0} \mathbf{X}_{i,k} + \mathbf{t}_{0} \tag{1}$

$\lambda_{i,k} \mathbf{x}_{i,k} = \mathbf{K}_{i}^{l} \mathbf{R}_{i} \mathbf{X}_{i,k} + \mathbf{t}_{i} \tag{2}$

要求 $d_{i,k}$ 表示 $\mathbf{X}_{i,k}$ 在参考视角下的深度， $\lambda_{i,k}$ 表示 $\mathbf{X}_{i,k}$ 在源视图下的深度。
$\Rightarrow \mathbf{X}_{i,k} = (\mathbf{K}_{0}^{l} \mathbf{R}_{0})^{-1} (d_{i,k} \bold{\tilde{x}} - \mathbf{t}_{0}) \tag{3}$
将(3)代入(2)，得：
$\lambda_{i,k} \mathbf{x}_{i,k} = \mathbf{K}_{i}^{l} \mathbf{R}_{i} (\mathbf{K}_{0}^{l} \mathbf{R}_{0})^{-1} (d_{i,k} \bold{\tilde{x}} - \mathbf{t}_{0}) + \mathbf{t}_{i} \tag{4}$
记 $\mathbf{A}^{-1}=\mathbf{K}_{i}^{l} \mathbf{R}_{i} (\mathbf{K}_{0}^{l} \mathbf{R}_{0})^{-1}$ ，则有：
$\mathbf{A}=\mathbf{K}_{0}^{l} \mathbf{R}_{0} (\mathbf{K}_{i}^{l} \mathbf{R}_{i})^{-1}$

$\lambda_{i,k} \mathbf{x}_{i,k} = \mathbf{A}^{-1} (d_{i,k} \bold{\tilde{x}} - \mathbf{t}_{0}) + \mathbf{t}_{i}$

假设参考视图上有像素点 $\mathbf{p}^{r}$ ，根据对应的深度 $d^r_1$ , $d^r_2$ ，通过反投影可以得到的两个3D点 $\mathbf{X}_1$ , $\mathbf{X}_2$ 。这两个3D点投影到源视图上，得到一条极线上的两个点 $\mathbf{p}^s_1$ , $\mathbf{p}^s_2$ ，且这两个像素点对应的深度为 $d^s_1$ , $d^s_2$ ，则有以下关系：
$d^r_1 \mathbf{p}^{r} = d^s_1 \mathbf{A} \mathbf{p}^s_1 \\ d^r_2 \mathbf{p}^{r} = d^s_2 \mathbf{A} \mathbf{p}^s_2$
而我们要求的是 $\Delta d = d^r_1 - d^r_2$ ，则：
$d^r_1 \mathbf{p}^{r} - d^r_2 \mathbf{p}^{r} = d^s_1 \mathbf{A} \mathbf{p}^s_1 - d^s_2 \mathbf{A} \mathbf{p}^s_2 \\ \Rightarrow \Delta d \mathbf{p}^{r} = d^s_1 \mathbf{A} \mathbf{p}^s_1 - d^s_2 \mathbf{A} \mathbf{p}^s_2$
已知量 $\mathbf{p}^{r}$ , $\mathbf{A}$ , $\mathbf{p}^s_1$ , $\mathbf{p}^s_2$ , $d^s_1$ , 未知量 $\Delta d$ 和 $d^s_2$ 。则对上式进行变形：
$\Delta d \mathbf{p}^{r} + d^s_2 \mathbf{A} \mathbf{p}^s_2 = d^s_1 \mathbf{A} \mathbf{p}^s_1 \\ \Rightarrow \begin{bmatrix} \mathbf{p}^{r} & (\mathbf{A} \mathbf{p}^s_2)_{0:2} \end{bmatrix} \begin{bmatrix} \Delta d \\ d^s_2 \end{bmatrix} = d^s_1 \mathbf{A} \mathbf{p}^s_1 \\ \Rightarrow \begin{bmatrix} \Delta d \\ d^s_2 \end{bmatrix} =\begin{bmatrix} \mathbf{p}^{r} & (\mathbf{A} \mathbf{p}^s_2)_{0:2} \end{bmatrix}^{-1} d^s_1 \mathbf{A} \mathbf{p}^s_1$

从已有深度图中，计算当前层级的深度假设的代码：

#这个代码的核心步骤是，根据已有的深度D，计算一个深度采样间隔interval，使得当前深度的变化范围在[D-d*interval,D+d*interval]，其中d为fronto-parallel planes的数量。
def calDepthHypo(netArgs, ref_depths,ref_intrinsics, src_intrinsics,ref_extrinsics, src_extrinsics, depth_min, depth_max, level):
    ## Calculate depth hypothesis maps for refine steps

    nhypothesis_init = 48
    d = 4
    pixel_interval = 1

    nBatch = ref_depths.shape[0]
    height = ref_depths.shape[1]
    width = ref_depths.shape[2]

    if netArgs.mode == "train":
        depth_interval = torch.tensor([6.8085]*nBatch).cuda() # Hard code the interval for training on DTU with 1 level of refinement.
        depth_hypos = ref_depths.unsqueeze(1).repeat(1,d*2,1,1)
        for depth_level in range(-d,d):
            depth_hypos[:,depth_level+d,:,:] += (depth_level)*depth_interval[0]
        return depth_hypos

    #下面的目的是，将当前深度对应的3D点，投影到源图像上，计算极线，并根据极线找到间隔为1pixel的两个点，再将这两个点投到参考图像的3D点上，从而计算interval
    with torch.no_grad():
        ref_depths = ref_depths
        ref_intrinsics = ref_intrinsics.double()
        src_intrinsics = src_intrinsics.squeeze(1).double()
        ref_extrinsics = ref_extrinsics.double()
        src_extrinsics = src_extrinsics.squeeze(1).double()

        interval_maps = []
        depth_hypos = ref_depths.unsqueeze(1).repeat(1,d*2,1,1)
        for batch in range(nBatch):
            xx, yy = torch.meshgrid([torch.arange(0,width).cuda(),torch.arange(0,height).cuda()])
            xxx = xx.reshape([-1]).double()
            yyy = yy.reshape([-1]).double()

            X = torch.stack([xxx, yyy, torch.ones_like(xxx)],dim=0) #参考像素坐标（齐次坐标）

            D1 = torch.transpose(ref_depths[batch,:,:],0,1).reshape([-1]) # Transpose before reshape to produce identical results to numpy and matlab version.
            D2 = D1+1

            X1 = X*D1
            X2 = X*D2
            ray1 = torch.matmul(torch.inverse(ref_intrinsics[batch]),X1)
            ray2 = torch.matmul(torch.inverse(ref_intrinsics[batch]),X2)

            X1 = torch.cat([ray1, torch.ones_like(xxx).unsqueeze(0).double()],dim=0)
            X1 = torch.matmul(torch.inverse(ref_extrinsics[batch]),X1)
            X2 = torch.cat([ray2, torch.ones_like(xxx).unsqueeze(0).double()],dim=0)
            X2 = torch.matmul(torch.inverse(ref_extrinsics[batch]),X2)

            X1 = torch.matmul(src_extrinsics[batch][0], X1)
            X2 = torch.matmul(src_extrinsics[batch][0], X2)

            X1 = X1[:3]
            X1 = torch.matmul(src_intrinsics[batch][0],X1)
            X1_d = X1[2].clone() #3D点源图像上的深度
            X1 /= X1_d #源图像上像素坐标X1，当前深度对应的3D点

            X2 = X2[:3]
            X2 = torch.matmul(src_intrinsics[batch][0],X2)
            X2_d = X2[2].clone()
            X2 /= X2_d #源图像上像素坐标X1，“当前深度+1”对应的3D点
            
               #计算极线的斜率、截距、角度
            k = (X2[1]-X1[1])/(X2[0]-X1[0])
            b = X1[1]-k*X1[0]
            theta = torch.atan(k)
            #计算与X1的距离为1像素的的另一个点
            X3 = X1 + torch.stack([ torch.cos(theta)*pixel_interval, torch.sin(theta)*pixel_interval, torch.zeros_like(X1[2,:]) ],dim=0)

            #计算变换矩阵（不包含旋转），用于将点从 源像素坐标系 变换到 参考像素坐标系
            A = torch.matmul(ref_intrinsics[batch],ref_extrinsics[batch][:3,:3])
            tmp = torch.matmul(src_intrinsics[batch][0],src_extrinsics[batch][0,:3,:3])
            A = torch.matmul(A,torch.inverse(tmp)) 

            tmp1 = X1_d*torch.matmul(A,X1) #将点变换到 参考像素坐标系(带有深度)
            tmp2 = torch.matmul(A,X3) #将点变换到 参考像素坐标系(无深度)

            M1 = torch.cat([X.t().unsqueeze(2),tmp2.t().unsqueeze(2)],axis=2)[:,1:,:]
            M2 = tmp1.t()[:,1:]
            ans = torch.matmul(torch.inverse(M1),M2.unsqueeze(2))
            delta_d = ans[:,0,0] #计算深度差

            interval_maps = torch.abs(delta_d).mean().repeat(ref_depths.shape[2],ref_depths.shape[1]).t()

            for depth_level in range(-d,d):
                depth_hypos[batch,depth_level+d,:,:] += depth_level*interval_maps

        # print("Calculated:")
        # print(interval_maps[0,0])

        # pdb.set_trace()

        return depth_hypos.float() # Return the depth hypothesis map from statistical interval setting.

Depth Map Estimator

与MVSNet类似，对构建的cost volume pyramid $\left\{\mathbf{C}^{l}\right\}_{l=0}^{L}$ 应用3D卷积来聚合上下文信息和输出概率体积 $\left\{\mathbf{P}^{l}\right\}_{l=0}^{L}$ , where $\mathbf{P}^{l} \in \mathbb{R}^{H / 2^{l} \times W / 2^{l} \times M}$ . 详细的3D卷积网络设计见附录。

请注意， $\mathbf{P}^{L}$ and $\left\{\mathbf{P}^{l}\right\}_{l=0}^{L-1}$ 分别是根据绝对深度和残差深度生成的。因此，首先对 $\mathbf{P}^{L}$ 应用soft-argmax来获得粗深度图。然后通过对 $\left\{\mathbf{P}^{l}\right\}_{l=1}^{L-1}$ 应用soft-argmax对得到的深度图进行迭代细化，得到更高分辨率的深度残差。

在 $L$ 层级的采样深度为 $d=d_{\min }+m\left(d_{\max }-\right.$ $\left.d_{\min }\right) / M, m \in\{0,1,2, \cdots, M-1\}$ 。因此，每个像素 $\mathbf{p}$ 的深度估计计算为：
$\mathbf{D}^{L}(\mathbf{p})=\sum_{m=0}^{M-1} d \mathbf{P}_{\mathbf{p}}^{L}(d)$
为了进一步改进当前的估计，即粗深度图或第 $(l + 1)$ 层的refined深度，我们估计残差深度。记 $r_{\mathbf{p}}=m \cdot \Delta d_{\mathbf{p}}^{l}$ 为深度残差假设，计算下一层级更新的深度为
$\mathbf{D}^{l}(\mathbf{p})=\mathbf{D}_{\uparrow}^{l+1}(\mathbf{p})+\sum_{m=-M / 2}^{(M-2) / 2} r_{\mathbf{p}} \mathbf{P}_{\mathbf{p}}^{l}\left(r_{\mathbf{p}}\right)$
其中 $\in\{L-1, L-2, \cdots, 0\}$ .

In our experiments, we observe no depth map refinement after our pyramidal depth estimation is further required to obtain good results.

深度图推断代码：

class CostRegNet(nn.Module):
    def __init__(self):
        super(CostRegNet, self).__init__()

        self.conv0 = ConvBnReLU3D(16, 16, kernel_size=3, pad=1)
        self.conv0a = ConvBnReLU3D(16, 16, kernel_size=3, pad=1)

        self.conv1 = ConvBnReLU3D(16, 32,stride=2, kernel_size=3, pad=1)
        self.conv2 = ConvBnReLU3D(32, 32, kernel_size=3, pad=1)
        self.conv2a = ConvBnReLU3D(32, 32, kernel_size=3, pad=1)
        self.conv3 = ConvBnReLU3D(32, 64, kernel_size=3, pad=1)
        self.conv4 = ConvBnReLU3D(64, 64, kernel_size=3, pad=1)
        self.conv4a = ConvBnReLU3D(64, 64, kernel_size=3, pad=1)

        self.conv5 = nn.Sequential(
            nn.ConvTranspose3d(64, 32, kernel_size=3, padding=1, output_padding=0, stride=1, bias=False),
            nn.BatchNorm3d(32),
            nn.ReLU(inplace=True))

        self.conv6 = nn.Sequential(
            nn.ConvTranspose3d(32, 16, kernel_size=3, padding=1, output_padding=1, stride=2, bias=False),
            nn.BatchNorm3d(16),
            nn.ReLU(inplace=True))

        self.prob0 = nn.Conv3d(16, 1, 3, stride=1, padding=1)

    def forward(self, x):

        conv0 = self.conv0a(self.conv0(x))
        conv2 = self.conv2a(self.conv2(self.conv1(conv0)))
        conv4 = self.conv4a(self.conv4(self.conv3(conv2)))

        conv5 = conv2+self.conv5(conv4)

        conv6 = conv0+self.conv6(conv5)
        prob = self.prob0(conv6).squeeze(1)

        return prob

深度的计算：

def depth_regression(p, depth_values):
    depth_values = depth_values.view(*depth_values.shape, 1, 1)
    depth = torch.sum(p * depth_values, 1)
    return depth

# Regularize cost volume
cost_reg = self.cost_reg_refine(cost_volume)

prob_volume = F.softmax(cost_reg, dim=1)
depth = depth_regression(prob_volume, depth_values=depth_hypos)

Loss Function

We adopt a supervised learning strategy and construct the pyramid for ground truth depth $\left\{\mathbf{D}_{\mathrm{G} T}^{l}\right\}_{l=0}^{L}$ as supervi-sory signal. Similar to existing MVSNet framework [42], we make use of the $l_{1}$ norm measuring the absolute differ-ence between the ground truth and the estimated depth. For each training sample, our loss is
$\text { Loss }=\sum_{l=0}^{L} \sum_{\mathbf{p} \in \Omega}\left\|\mathbf{D}_{\mathrm{G} T}^{l}(\mathbf{p})-\mathbf{D}^{l}(\mathbf{p})\right\|_{1},$
where $\Omega$ is the set of valid pixels with ground truth measurements.