MyDLNote-360Camera：全景图像深度估计，结合 equirectangular 和 cubemap 两种映射（2020CVPR BiFuse）

本文链接：https://blog.csdn.net/u014546828/article/details/110225653

BiFuse: Monocular 360◦ Depth Estimation via Bi-Projection Fusion

Fig. 1. Our BiFuse network estimates the 360◦ depth from a monocular image using both equirectangular and cubemap projections. A bi-projection fusion component is proposed to leverage both projections inspired by both peripheral and foveal vision of the human eye. Given the estimated 360◦ depth, a complete 3D point cloud surrounding the camera can be generated to serve downstream applications.

BiFuse: Monocular 360◦ Depth Estimation via Bi-Projection Fusion

Abstract

Depth estimation from a monocular 360◦ image is an emerging problem that gains popularity due to the availability of consumer-level 360◦ cameras and the complete surrounding sensing capability.

While the standard of 360◦ imaging is under rapid development, we propose to predict the depth map of a monocular 360◦ image by mimicking both peripheral and foveal vision of the human eye.

To this end, we adopt a two-branch neural network leveraging two common projections: equirectangular and cubemap projections.

In particular, equirectangular projection incorporates a complete field-of-view but introduces distortion, whereas cubemap projection avoids distortion but introduces discontinuity at the boundary of the cube. Thus we propose a bi-projection fusion scheme along with learnable masks to balance the feature map from the two projections. Moreover, for the cubemap projection, we propose a spherical padding procedure which mitigates discontinuity at the boundary of each face.

We apply our method to four panorama datasets and show favorable results against the existing state-of-the-art methods.

第一句，背景意义：

第二句，motivation：通过模仿人眼周边和中央凹视觉的特点，本文提出对单眼 360◦ 图像深度预测的方法。

第三局，本文策略：利用两个常见的投影 : 等矩形投影和立方投影，来实现深度预测。

第四-六句，具体方法：等矩形投影包含了一个完整的视场但是引入了畸变，而立方投影避免了畸变但是在立方体的边界引入了不连续。因此，本文提出了一种双投影融合方案，并结合可学习 masks 来平衡来自两个投影的feature map。对于立方投影，提出了一个球面填充程序，以减轻在每个面边界的不连续。（算法贡献：双投影融合和球面填充程序）。

第七句，实验结果。

Introduction

Inferring 3D structure from 2D images has been widely studied due to numerous practical applications. For instance, it is crucial for autonomous systems like self-driving cars and indoor robots to sense the 3D environment since they need to navigate safely in 3D. Among several techniques for 3D reconstruction, significant improvement has been achieved in monocular depth estimation due to the advance of deep learning and availability of large-scale 3D training data. For example, FCRN [16] achieves monocular depth estimation by their proposed up-projection module. However, most of the existing methods are designed for a camera with normal field-of-view (FoV). As 360◦ camera becomes more and more popular in recent years, the ability to infer the 3D structure of a camera’s complete surrounding has motivated the study of monocular 360◦ depth estimation.

背景介绍：细读，这段背景写地还挺好的。从 3D 的实际应用需求引出故事，介绍了深度估计是 3D 构建的重要方法，最后点题，monocular 360◦ depth estimation 十分重要。思路很清晰。

In this paper, we propose an end-to-end trainable neural network leveraging two common projections – equirectangular and cubemap projection – as inputs to predict the depth map of a monocular 360◦ image. Our main motivation is to combine the capability from both peripheral and foveal vision like the human eye (see Fig. 1 for the illustration). Note that, equirectangular projection provides a wide field-of-view mimicking a peripheral vision, whereas cubemap projection provides a smaller but non-distorted fieldof-view mimicking the foveal vision.

On the one hand, equirectangular projection allows all surrounding information to be observed from a single 2D image but introduces distortion. On the other hand, cubemap projection avoids distortion but introduces discontinuity at the boundary of the cube. Considering both projections would have the complementary property to each other, where we refer to our method as BiFuse.

本文利用 equirectangular 和 cubemap 两种投影作为输入来预测单眼 360 图像的深度图。主要动机是像人眼一样结合外围和中央凹视觉的能力 (如图1所示)。equirectangular 投影提供了一个全景视场来模拟周边视觉，而 cubemap 投影提供了没有扭曲的场域视图来模拟中央凹视觉。（全文的核心思想）

一方面，equirectangular 投影允许从一个单一的二维图像观察到所有的周围信息，但引入了失真。另一方面，cubemap 投影避免了失真，但在立方体的边界上引入了不连续。考虑到这两个投影将具有相互补充的性质，本文提出了 BiFuse。

However, the FoV of the foveal vision could be too small, which degrades the effectiveness of our fusion scheme (Fig. 2). To tackle this issue, cube padding (CP) methods [26, 4] have been proposed to expand field-ofview from neighboring faces on the cube. Nevertheless, using cube padding may result in geometric inconsistency at the boundary that introduces non-negligible distortion effect. Therefore, we propose spherical padding (SP) which pads the boundary by considering the spherical geometry and reduces the boundary inconsistency. Finally, instead of naively combining features of both branches (e.g., [31]), we propose a bi-projection fusion procedure with learnable masks to balance the information shared between two projections. The source code and pretrained models are available to the public.

Fig.2. Field-of-view (FoV) comparison.

Equirectangular projection has the largest FoV compared to each face on the cubemap projection with (solid-line) or without (dash-line) the proposed spherical padding.

锦上添花，本文又提出了 spherical padding 来缓解 cube padding 导致 cubemap 图像不连续问题。

提出的 equirectangular 和 cubemap 的特征融合方法，通过学习 mask，对两种特征进行选择性融合。

We apply our method to four panorama datasets: Matterport3D [3], PanoSUNCG [26], 360D [38] and Stanford2D3D [1]. Our experimental results show that the proposed method performs favorably against the current stateof-the-art (SOTA) methods. In addition, we present extensive ablation study for each of the proposed modules, including the spherical padding and fusion schemes.

数据集和实验结果。数据集比较重要。我之前的博客特别介绍了 Matterport3D [3], PanoSUNCG [26] 和 Stanford2D3D [1] 的下载方法，可以参考链接：全景（360 度相机）图像数据集 3D60 Dataset 下载步骤（详细）。

[3] Matterport3D: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision (3DV), 2017

[26] Self-supervised learning of depth and camera motion from 360◦ videos. In Asian Conference on Computer Vision (ACCV), 2018

[38] Omnidepth: Dense depth estimation for indoors spherical panoramas. In European Conference on Computer Vision (ECCV), 2018

[1] Joint 2d-3d-semantic data for indoor scene understanding. CoRR, 2017

E2C and C2E

For a cubemap representation with sides of equal length w, we denote its six faces as fi , i ∈ {B, D, F, L, R, U}, corresponding to the ones on the back, down, front, left, right and up, respectively. Each face can be treated as the image plane of an independent camera with focal length w/2 , in which all these cameras share the same center of projection (i.e., the center of the cube) but with different poses. When we set the origin of the world coordinate system to the center of the cube, the extrinsic matrix of each camera coordinate system can be simply defined by a rotation matrix $R_f_i$ and zero translation. Given a pixel pi on the image plane fi with its coordinate (x, y, z) on the corresponding camera system, where 0 ≤ x, y ≤ w-1 and z = w/2 , we can transform it into the equirectangular representation by a simple mapping:

where $\theta _{f_i}$ and $\varphi _{f_i}$ are longitude and latitude in equirectangular projection; and $q_{x_i}$ , $q_{y_i}$ , $q_{z_ i }$ are the x, y, z components of $q_i$ respectively. As this mapping is reversible, we are able to easily perform both equirectangular-to-cube and cube-toequirectangular transformations, which are denoted as E2C and C2E, respectively. A more detailed illustration is shown in the supplementary material.

这里只给出了如何从一个平面图像到 equirectangular 图像的变换公式，并没有给出其逆变换公式。说是在补充材料里面，但无论是在 CVF Openacess 还是 project 网页，都没有给出这个补充材料。

Spherical Padding

Due to the distortion in the equirectangular projection, directly learning a typical convolutional neural network to perform monocular depth estimation on equirectangular images would lead to unstable training process and unsatisfying prediction [4]. In contrast, the cubemap representation suffers less from distortion but instead produces large errors since the discontinuity across the boundaries of each face [4, 26]. In order to resolve this issue for cubemap projection, Cheng et al. [4] propose the cube padding (CP) approach to utilize the connectivity between faces on the cube for image padding. However, solely padding the feature map of a face by using the features from its neighboring faces does not follow the characteristic of perspective projection. Therefore, here we propose the spherical padding (SP) method, which pads the feature according to spherical projection. As such, we can connect each face with the geometric relationship. A comparison between the cube padding [4] and our proposed spherical padding is illustrated in Fig. 3.

Figure 3. Spherical padding v.s. cube padding. Cube padding directly pads the feature of the connected faces. In addition to obvious inconsistency at the boundary, the values of four corners are undefined. In [4], the values are only chosen by the closest side. In our proposed spherical padding, the padding area is calculated with spherical projection. As a result, both the missing corner and inconsistency at the boundary can be addressed.

这段给出了为什么提出 SP，以及 SP 的直观方法。为了使得 cubemap 每个面的边缘连续，CP 直接扩展了邻近面的一部分，但没有考虑几何连续，即四个角上的内容是空的，不连续的。SP 就是将这四个角也补齐了。那这是怎么做到的呢，这么做的好处是什么呢？看下文继续介绍。

The most straightforward way to apply spherical padding for cubemap is to first transform all the faces into a unified equirectangular image by C2E. Then, we extend the original FoV σ = 90◦ to σ ′ , and map it back to the cubemap by E2C. As a result, we can pad them on each face completely without missing parts (i.e., undefined areas in cube padding of Fig. 3) and with consistent geometry. Specifi- cally, given a cubemap with side length w and Fov σ = 90◦ , the C2E transformation is identical to the inverse calculation of (1). When we apply spherical padding with padding size γ, which is determined by the padding size in the convolution layer (e.g.,γ=1 for a 3×3convolution layer), we update the side length of a cube face to w ′ = w + 2γ, and the corresponding FoV becomes σ ′ = 2 arctan (w/2+γ)/(w/2) after padding, as illustrated in Fig. 5. Hence, for mapping from equirectangular image back to the padded cubemap, we should use both w ′ and σ ′ to derive the correct E2C transformation for spherical padding.

Figure 5. The cubemap with length w and padding size γ. We keep the focal length the same (0.5w) and calculate a new FoV σ ′ .

Efficient Transformation. We have described the overall concept of our spherical padding. However, the above procedure consists of both C2E and E2C transformations, which could require heavy computational cost. Therefore, we simplify this procedure by deriving a direct mapping function between two cube faces. Given two cube faces fi and fj , we first denote the geometric transformation between their camera coordinate systems as a rotation matrix $R_{f_i\rightarrow f_j}$ . Then the mapping from a pixel pi in fi to fj can be established upon the typical projection model of pinhole cameras:

where (x, y) represents the 2D location of pi after being mapped onto the image plane of fj . Since this mapping only needs to be computed once for all the pixels on the padding region, the computational cost of applying spherical padding is comparable with cube padding, without any E2C or C2E transformation included.

首先，文章介绍了一种最直接的 SP 方法，即将 cubemap 图片转换成 equirectangular 图片，然后从 equirectangular 图片转 cubemap 图片时， σ 适当扩大，如图 5。这个 σ 扩大多少，取决于卷积核的大小，作者也给出了这个角度的计算公式。这样做的问题是需要做 C2E and E2C 两个变换，不经济。因此作者给出了新的变换方法，不需要操作 C2E and E2C。这里不详细解释这个方法（暂时不抠细节，用的时候再研究）。

BiFuse Network

Overview

Overall, our model consists of two encoder-decoder branches which take the equirectangular image and cubemap as input, respectively, where we denote the equirectangular branch as Be and the cubemap one as Bc. As mentioned in Sec. 1, each branch has its benefit but also suffers from some limitations. To jointly learn a better model while sharing both advantages, we utilize a bi-projection fusion block that bridges the information across two branches, which will be described in the following. To generate the final prediction, we first convert the prediction of cubemap to the equirectangular view and adopt a convolution module to combine both predictions.

Figure 4. The proposed BiFuse Network. Our network consists of two branches Be and Bc. The input of Be is an RGB equirectangular image, while Bc takes the corresponding cubemap as input. We replace the first convolution layer in Be with a Pre-Block [38, 23]. For the decoder, we adopt up-projection [16] modules. For each convolution and up-projection layer in Bc, we apply our spherical padding to connect feature maps of six faces. Most importantly, between feature maps from Be and Bc, we use the proposed bi-projection fusion module to share information between two feature representations. Finally, we add a Conv module [24] to unify two depth predictions from Be and Bc.

整体地看，本文的 BiFuse Network 包括两个分支，一个用 equirectangular 图像估计深度，一个用 cubemap 图像。二者各有优缺点，为了将各自优点影响另外一方，两个分支的每个编解码模块之间用 Bi-Projection Fusion 模块进行信息传递。最后，将 cubemap 分支估计的深度图像转换到 equirectangular 域，然后与 equirectangular 分支估计的深度图级联和卷积，输出最终结果。

因此，Bi-Projection Fusion 是该网络的关键。

Bi-Projection Fusion

如何融合两个不同域的特征呢？最简单的当然就是 concatenate 操作。但这样做，并不是最优的，或者说网络效率没有达到最优，还有改进的余地。本文提出的融合模块非常巧妙，思路值得借鉴。

To encourage the information shared across two branches, we empirically find that directly combining feature maps [31] from Be and Bc would result in unstable gradients and training procedure, and thus it is keen to develop a fusion scheme to balance two branches. Inspired by the recent works in multi-tasking [5, 36], we focus on balancing the feature map from two different representations. To achieve this goal, we propose a bi-projection fusion module H: given feature maps he and hc from Be and Bc in each layer respectively, we estimate the corresponding feature maps h′e = He(he) and h′c = Hc(C2E(hc)), where He and Hc indicate a convolution layer.

To produce feature maps that benefit both branches, we first concatenate h′e and h′c , and then pass it to a convolution layer with the sigmoid activation to estimate a mask M to balance the fusion procedure. Finally, we generate feature maps h¯e and h¯c as the input to the next layer as:

Note that we use C2E and E2C operations in the fusion procedure to ensure that features and the mask M are in the same projection space.

首先，当考虑融合两个域的特征时，需要将二者统一到一个统一的域，要么选择一个中间域，要么选择二值之一的域，因为只有域统一了，融合才是高效的。本文，将 cubemap 域都转换成 equirectangular 域。

然后，统一到一个域的两个特征就可以名正言顺地进行 concatenate 。

再然后，级联的特征经过卷积层和 sigmoid，学习到一个 mask，M。很显然，这是一个 attention 模型。

最后，根据 attention 公式（3），得到每个域内新的、增强的特征。注意的是，学习到的新特征分量是 equirectangular 域的，在 cubemap 域中，这个新特征需要变换到 cubemap 域。

[5] Segflow: Joint learning for video object segmentation and optical flow. In IEEE International Conference on Computer Vision (ICCV), 2017 [github]

[36] Joint task-recursive learning for semantic segmentation and depth estimation. In European Conference on Computer Vision (ECCV), 2018

Loss Function

We adopt the reverse Huber loss [16] as the objective function for optimizing predictions from both Be and Bc:

The overall objective function is then written as:

where De and Dc are the predictions produced by Be and Bc respectively; DGT is the ground truth depth in the equirectangular representation; and P indicates all pixels where there is a valid depth value in the ground truth map. We note that the C2E operation is required on converting Dc into the equirectangular form before computing the loss.