360度相机(全景图片)中的卷积(二):SphereNet: Spherical Representations

360度相机(全景图片)中的卷积(一):Equirectangular Convolutions

360度相机(全景图片)中的卷积(二):SphereNet: Spherical Representations

360度相机(全景图片)中的卷积(三):Spherical Convolution

 

 

SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images

用于全向图像检测和分类的球面表示

[ECCV 2018 paper]  [PyTorch]

Abstract

Omnidirectional cameras offer great benefits over classical cameras wherever a wide field of view is essential, such as in virtual reality applications or in autonomous robots.

Unfortunately, standard convolutional neural networks are not well suited for this scenario as the natural projection surface is a sphere which cannot be unwrapped to a plane without introducing significant distortions, particularly in the polar regions.

In this work, we present SphereNet, a novel deep learning framework which encodes invariance against such distortions explicitly into convolutional neural networks. Towards this goal, SphereNet adapts the sampling locations of the convolutional filters, effectively reversing distortions, and wraps the filters around the sphere. By building on regular convolutions, SphereNet enables the transfer of existing perspective convolutional neural network models to the omnidirectional case.

We demonstrate the effectiveness of our method on the tasks of image classification and object detection, exploiting two newly created semi-synthetic and real-world omnidirectional datasets.

摘要 4 个部分:

背景介绍:与传统相机相比,全向相机具有更大的优势,特别是在需要宽视野的地方,例如在虚拟现实应用程序或自主机器人中。

提出问题:标准的卷积神经网络并不适合全向图像,因为投影表面是一个球体,它不可能在不造成严重扭曲的情况下被展开到一个平面上,尤其是在极地地区。

本文工作:提出了 SphereNet。SphereNet调整了卷积滤波器的采样位置,有效地扭转了扭曲,并将滤波器包裹在球体周围。SphereNet 能够将现有的透视卷积神经网络模型转移到全向情况下。

方法验证:在半合成和真实两个全向数据集上的图像分类和目标检测任务中,证明了本文的方法有效性。

 

contributions

• We introduce SphereNet, a framework for learning spherical image representations by encoding distortion invariance into convolutional filters. SphereNet retains the original spherical image connectivity and, by building on regular convolutions, enables the transfer of perspective CNN models to omnidirectional inputs.

• We improve the computational efficiency of SphereNet using an approximately uniform sampling of the sphere which avoids oversampling in the polar regions.

• We create two novel semi-synthetic and real-world datasets for object detection in omnidirectional images.

• We demonstrate improved performance as well as SphereNet’s transfer learning capabilities on the tasks of image classification and object detection and compare our results to several state-of-the-art baselines.

四点贡献:1)introduce SphereNet;2)mprove the computational efficiency of SphereNet;3)create  datasets ;4)demonstrate performance

 

Related Work

1. Khasanova et al. [14] propose a graph-based approach for omnidirectional image classification.

However, graph convolutional networks are limited to small graphs and image resolutions (50 × 50 pixels in [15])

2. Cohen et al. [3] propose to use spherical CNNs for classification and encode rotation equivariance into the network.

However, often full rotation invariance is not desirable: similar to regular images, 360◦ images are mostly captured in one dominant orientation

3. Su et al. [30] propose to process equirectangular images with regular convolutions by increasing the kernel size towards the polar regions.

However, this adaptation of the convolutional filters is a simplistic approximation of distortions in the equirectangular representation and implies that weights can only be shared along each row, resulting in a significant increase in model parameters.

4. cube map projections as considered in [19, 22]

. However, this approach does not remove distortions but only minimizes their effect.

5. Geometric transformations: Jaderberg et al. [11], introduce a separate network which learns to predict the parameters of a spatial transformation of an input feature map.

6. Geometric transformations: Scattering convolution networks [1,25] use predefined wavelet filters to encode stable geometric invariants into networks while other recent works encode invariances into learned convolutional filters [4, 9, 29, 31].

7. Several recent works also consider adapting the sampling locations of convolutional networks, either dynamically [5] or statically [12, 18]

1-4 就不讲了,是很久之前的工作思想。

最新的是两种:基于几何变换的方法和基于局部采样的方法。

 

Method

 

 

Kernel Sampling Pattern

The central idea of SphereNet is to lift local CNN operations (e.g. convolution, pooling) from the regular image domain to the sphere surface where fisheye or omnidirectional images can be represented without distortions. This is achieved by representing the kernel as a small patch tangent to the sphere as illustrated in Fig. 1d. Our model focuses on distortion invariance and not rotation invariance, as in practice 360◦ images are mostly captured in one dominant orientation. Thus, we consider upright patches which are aligned with the great circles of the sphere.

 Fig. 1 (d)  Our SphereNet kernel exploits projections (red) of the sampling pattern on the tangent plane (blue), yielding filter outputs which are invariant to latitudinal rotations.

三个核心思想:

1. SphereNet 的中心思想是将 CNN 的局部操作 (例如卷积、池化) 从规则图像域提升到球面上;

2. 将内核表示为一个与球面相切的 patch;

3. 重点是失真不变性

 

More formally, let S be the unit sphere with S ^2 its surface. Every point s = (\varphi , \theta ) \in S^2 is uniquely defined by its latitude  \varphi \in [- \pi/ 2 , \pi/ 2 ] and longitude \theta \in [-\pi, \pi]. Let further \Pi denote the tangent plane located at s_{\Pi} = (\varphi _{\Pi}, \theta _{\Pi}). We denote a point on \Pi by its coordinates x \in R^2 . The local coordinate system of \Pi is hereby centered at s and oriented upright. Let \Pi_0 denote the tangent plane located at s = (0, 0). A point s on the sphere is related to its tangent plane coordinates x via a gnomonic projection [20].

分别给了球面 S 坐标和切平面空间 \Pi 坐标的表示方法。此时,球面坐标和切面坐标是通过 日晷投影(球心投影 gnomonic projection)建立关系的。

 

While the proposed approach is compatible with convolutions of all sizes, in the following we consider a 3 × 3 kernel, which is most common in state-of-the-art architectures [8, 26]. We assume that the input image is provided in equirectangular format which is the de facto standard representation for omnidirectional cameras of all form factors (e.g. catadioptric, dioptric or polydioptric). In Section 3.2 we consider a more efficient representation that improves the computational efficiency of our method.

然所提出的方法兼容所有大小的卷积,但在下面我们考虑一个 3×3 的内核。假设输入的图像是以等矩形格式提供的,这实际上是所有形式的全向相机的标准表示(例如,折反射、屈光或多屈光)。在第3.2节中,将考虑一种更有效的表示方法,以提高我们方法的计算效率。

The kernel shape is defined so that its sampling locations s(j,k) , with j, k \in \{-1, 0, 1\} for a 3 × 3 kernel, align with the step sizes ∆θ and ∆φ of the equirectangular image at the equator. This ensures that the image can be sampled at \Pi_0 without interpolation:

The position of these filter locations on the tangent plane \Pi_0 can be calculated via the gnomonic projection [20]

For the sampling pattern s(j,k), this yields the following kernel pattern x(j,k) on \Pi_0:

We keep the kernel shape on the tangent fixed. When applying the filter at a different location s_{\Pi} = (\varphi _{\Pi}, \theta _{\Pi}) of the sphere, the inverse gnomonic projection is applied

给定 S 平面卷积核的坐标(1)-(4),其在切平面投影的卷积核坐标表示为(7)-(10)

公式(5)(6)给出了球心投影,即从球面到切面的变换方法;(11)给出了逆球心投影,即从切面到球面的变换方法。

 

The sampling grid locations of the convolutional kernels thus get distorted in the same way as objects on a tangent plane of the sphere get distorted when projected from different elevations to an equirectangular image representation. Fig. 2 demonstrates this concept by visualizing the sampling pattern at two different elevations φ.

因此,卷积核的采样网格位置会发生畸变,就像球体切平面上的物体在从不同高度投影到等矩形图像表示时发生畸变一样。图 2 通过对两个不同高度的采样模式的可视化演示了 φ 这一概念。

Fig. 2: Kernel Sampling Pattern at φ = 0 (blue) and φ = 1.2 (red) in spherical (a) and equirectangular (b) representation. Note the distortion of the kernel at φ = 1.2 in (b).

 

Besides encoding distortion invariance into the filters of convolutional neural networks, SphereNet additionally enables the network to wrap its sampling locations around the sphere. As SphereNet uses custom sampling locations for sampling inputs or intermediary feature maps, it is straightforward to allow a filter to sample data across the image boundary. This eliminates any discontinuities which are present when processing omnidirectional images with a regular convolutional neural network and improves recognition of objects which are split at the sides of an equirectangular image representation or which are positioned very close to the poles, see Fig. 3.

除了将失真不变性编码到卷积神经网络的滤波器中,SphereNet 还使网络能够将其采样位置包裹在球体周围。由于 SphereNet 使用自定义采样位置对输入或中间特征映射进行采样,因此允许过滤器对图像边界上的数据进行采样。这就消除了任何不连续时存在用常规处理全向图像卷积神经网络,提高了识别的对象分割的两侧 equirectangular 图像表示或定位非常接近两极,见图 3。

 

Fig. 3: Sampling Locations.This figure compares the sampling locations of SphereNet(red) to the sampling locations of a regular CNN (blue) at the boundaries of the equirect-angular image. Note how the SphereNet kernel automatically wraps at the left imageboundary (a) while correctly representing the discontinuities and distortions at the pole(b). SphereNet thereby retains the original spherical image connectivity which is dis-carded in a regular convolutional neural network that utilizes zero-padding along theimage boundaries.

 

By changing the sampling locations of the convolutional kernels while keeping theirsize unchanged, our model additionally enables the transfer of CNN models between different image representations. In our experimental evaluation, we demonstrate how anobject detector trained on perspective images can be successfully applied to the omnidi-rectional case. Note that our method can be used for adapting almost any existing deeplearning architecture from perspective images to the omnidirectional setup. In general,our SphereNet framework can be applied as long as the image can be mapped to the unitsphere. This is true for many imaging models, ranging from perspective over fisheye to omnidirectional models. Thus, SphereNet can be seen as a generalization of regularCNNs which encodes the camera geometry into the network architecture.

通过改变卷积核的采样位置,同时保持其大小不变,本文的模型还实现了 CNN 模型在不同图像表示之间的传输。

注意:本文的方法可以用于适应几乎任何现有的深度结构,从透视图图像到全向设置。一般来说,只要图像可以映射到unit sphere上,SphereNet 框架就可以应用。从 fisheye 的视角到全向模型,许多成像模型都是如此。因此,SphereNet 可以看作是 regular CNNs 的一般化,后者将摄像机的几何形状编码到网络架构中。

 

  • Implementation :

As the sampling locations are fixed according to the geometry ofthe spherical image representation, they can be precomputed for each kernel location at every layer of the network. Further, their relative positioning is constant in each imagerow. Therefore, it is sufficient to calculate and store the sampling locations once perrow and then translate them. We store the sampling locations in look-up tables. Theselook-up tables are used in a customized convolution operation which is based on highlyoptimized general matrix multiply (GEMM) functions [13]. As the sampling locationsare real-valued, interpolation of the input feature maps is required. In our experiments,we compare nearest neighbor interpolation to bilinear interpolation. For an arbitrarysampling location (px, py) in a feature map f, interpolation is defined as:

with a bilinear interpolation kernel:

or a nearest neighbor kernel :

where δ(·) is the Kronecker delta function.

由于采样位置是根据球面图像表示的几何形状固定的,因此可以对网络的每一层的每个核位置进行预先计算。此外,它们在每个成像仪中的相对位置是恒定的。因此,对采样位置进行一次计算和存储,然后进行平移就足够了。本文采样位置存储在查找表中。

在高度优化的通用矩阵乘 (GEMM) 函数[13] 的基础上,将这些查表用于自定义卷积运算。当采样位置为实值时,需要对输入特征图进行插值。本文比较了最近邻插值和双线性插值。对于特征 map f 中的任意采样位置 (px, py),插值定义为 (12) .

 

小结

Equirectangular Convolutions 和 SphereNet 的不同点在于:

前者是将 球坐标下的标准卷积展开在 Equirectangular 平面上;

后者是将 球坐标下的标准卷积展开在 切平面 上,因而可以用在从球坐标到任何曲面变换坐标下,因为任何曲面变换都存在切平面。

 

 

 

  • 3
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值