The Edge of Depth: Explicit Constraints between Segmentation and Depth（深度的边缘:分割和深度之间的显式约束）

最新推荐文章于 2023-03-14 09:20:41 发布

1051450906

最新推荐文章于 2023-03-14 09:20:41 发布

阅读量1k

点赞数 1

分类专栏：论文阅读

本文链接：https://blog.csdn.net/qq_38162944/article/details/115022471

版权

摘要

In this work we study the mutual benefits of two common computer vision tasks, self-supervised depth estimation and semantic segmentation from images.
在这项工作中，我们研究了两种常见的计算机视觉任务的相互好处，自我监督的深度估计和语义分割的图像。（自监督学习通过data的一部分，来predict其他部分，由自身来提供监督信号，从而实现自监督学习）
For example, to help unsupervised monocular depth estimation, constraints from semantic segmentation has been explored implicitly such as sharing and transforming features.
例如，为了帮助无监督的单目深度估计，语义分割的约束已经被隐含地探索，如共享和转换特征。
In contrast, we propose to explicitly measure the border consistency between seg-mentation and depth and minimize it in a greedy manner by iteratively supervising the network towards a locally optimal solution.
相反，我们建议明确测量分割和深度之间的边界一致性，并通过迭代监督网络向局部最优解决方案最小化它。
Partially this is motivated by our observation that semantic segmentation even trained with limited ground truth (200 images of KITTI) can offer more accurate border than that of any (monocular or stereo) image-based depth estimation. Through extensive experiments, our proposed approach advances the state of the art on unsupervised monocular depth estimation in the KITTI.
部分原因是由于我们的观察结果，即使使用有限的地面实况（200张KITTI图像）进行训练，语义分割也可以提供比任何（单眼或立体）基于图像的深度估计更准确的边界。通过广泛的实验，我们提出的方法提高了KITTI中无监督单眼深度估计的最新技术水平。

1.引言

Estimating depth is a fundamental problem in computer vision with notable applications in self-driving [1] and virtual/augmented reality. To solve the challenge, a diverse set of sensors has been utilized ranging from monocular camera [12], multi-view cameras [4], and depth completion from LiDAR [18].
深度估计是计算机视觉的一个基本问题，在自动驾驶和虚拟/增强现实中有显著的应用。为了解决这一挑战，从单目摄像机、多视角摄像机，到激光雷达。
Although the monocular system is the least expensive, it is the most challenging due to scale ambiguity. The current highest performing monocular methods [9,14,22,25,39] are reliant on supervised training, thus consuming large amounts of labelled depth data.
虽然单目系统是最便宜的，它是最具挑战性的，由于规模模糊。目前表现最好的单目方法依赖监督训练，因此消耗了大量的标记深度数据
Recently,self-supervised methods with photometric supervision have made significant progress by leveraging unlabeled stereo images [10,12] or monocular videos [35,42,45] to approach comparable performance as the supervised methods.
近年来，带有光度监督的自我监督方法通过利用未标记的立体图像或单目视频来接近监督方法的性能，取得了显著的进展。
Yet, self-supervised depth inference techniques suffer from high ambiguity and sensitivity in low-texture regions,reflective surfaces, and the presence of occlusion, likely leading to a sub-optimal solution. To reduce these effects,many works seek to incorporate constraints from external modalities.
然而，自我监督的深度推论技术遭受重创来自低纹理区域的高度模糊性和敏感性，可能会导致次优解决方案。为了减少这些影响，许多研究者试图吸收来自外部的约束。
For example, prior works have explored leveraging diverse modalities such as optical flow [42], surface normal [40], and semantic segmentation [3,27,36,44].
例如，先前的工作已经探索了利用多种形式的方法，例如光流，表面法线和语义分割
Optical flow can be naturally linked to depth via ego-motion and object motion, while surface normal can be re-defined as direction of the depth gradient in 3D. Comparatively, semantic segmentation is unique in that, though highly relevant, it is difficult to form definite relationship with depth.
光流可以通过自我运动和物体运动自然地与深度联系起来，而表面法线可以被重新定义为三维深度梯度的方向。相对而言，语义分割的独特之处在于，它虽然相关性高，很难与深度形成确定的关系。
In response, prior works tend to model the relation of semantic segmentation and depth implicitly [3,27,36,44]. For instance, [3,36] show that jointly training a shared network with semantic segmentation and depth is helpful to both.[44] learns a transformation between semantic segmentation and depth feature spaces. Despite empirically positive results, such techniques lack clear and detailed explanation for their improvement. Moreover, prior work has yet to explore the relationship from one of the most obvious aspects— the shared borders between segmentation and depth.
因此，先前的研究倾向于隐含地对语义分割和深度之间的关系进行建模[3,27,36,44]。例如[3,36]表明，联合训练一个具有语义分割和深度的共享网络对两者都有帮助。[44]学习了语义分割和深度特征空间之间的转换。尽管实证结果是积极的，但这些技术缺乏对其改进的清晰、详细的解释。此外，之前的工作还没有从一个最明显的方面探索关系-分割和深度之间的共享边界。
Hence, we aim to explicitly constrain monocular self-supervised depth estimation to be more consistent and aligned to its segmentation counterpart.
因此，我们的目标是明确地约束单眼自监督深度估计，使其与分割对应对象更加一致和对齐。
We validate the intuition of segmentation being stronger than depth estimation for estimating object boundaries, even compared to depth from multi-view camera systems [41], thus demonstrating the importance of leveraging this strength (Tab. 3).
即使与多视图相机系统的深度相比，我们验证分割的直觉比深度估计更强，也能估计物体边界[41]，因此证明了利用这种优势的重要性。
We use the distance between segmentation and depth’s edges as a measurement of their consistency. Since this measurement is not differentiable, we can not directly optimize it as a loss.
我们使用语义分割和深度边缘之间的距离作为他们一致性的度量。由于这种度量是不可微的，我们不能直接将其作为损失进行优化。
Rather, it is optimized as a “greedy search”, such that we iteratively construct a local optimum augmented disparity map under the proposed measurement and penalize its discrepancy with the original prediction. The construction of augmented depth map is done via a modified Beier–Neely morphing algorithm [34]. In this way, the estimated depth map gradually becomes more consistent with the segmentation edges within the scene, as demonstrated in Fig. 1.
相反，它被优化为一个“greedy search”，这样我们就可以在建议的测量值下迭代构造局部最优增强视差图，并补偿其与原始预测的差异。增强深度图的构建是通过改进的Beier-Neely变形算法完成的[34]。以这种方式，估计的深度图逐渐变得与场景中的分割边缘更加一致。，如图1所示。
在这里插入图片描述

图1图示：我们将深度边界明确规范化为与分割边界一致，通过根据经过提炼的点对(pq)进行变形来创建一个“更好的”深度 $I^*$ ,通过在每个训练步骤中惩罚它与原始预测I的差异，我们逐渐实现了一个更一致的边界。由于篇幅有限，这种变化发生在每一对提取对上，但只显示了一对

Since we use predicted semantics labels [46], noise is inevitably inherited. To combat this, we develop several techniques to stabilize training as well as improve performance.
由于我们使用预测语义标签[46]，噪声不可避免地会被继承。为了解决这个问题，我们开发了几种技术来稳定训练和提高表现。
We also notice recent stereo-based self-supervised methods ubiquitously possess “bleeding artifacts”, which are fading borders around two sides of objects.
我们也注意到最近基于立体的自我监督方法都拥有“出血伪影”，即物体两侧的边界逐渐消失。
We trace its cause to occlusions in stereo cameras near object boundaries and resolve by integrating a novel stereo occlusion mask into the loss, further enabling quality edges and subsequently facilitating our morphing technique.
我们将其原因追溯到物体边界附近的立体摄像机中的遮挡，并通过将新颖的立体遮挡掩模集成到损失中来解决，进一步实现高质量边缘并随后促进我们的变形技术。
Our contributions can be summarized as follows:
我们的贡献可以总结如下
We explicitly define and utilize the border constraint between semantic segmentation and depth estimation, resulting in depth more consistent with segmentation.
我们明确地定义和利用了语义分割和深度估计之间的边界约束，使深度与分割更加一致。
We alleviate the bleeding artifacts in prior depth methods [3,12,13,29] via proposed stereo occlusion mask, furthering the depth quality near objec