PoseNet论文总结

最新推荐文章于 2025-03-21 20:24:19 发布

bai君

最新推荐文章于 2025-03-21 20:24:19 发布

阅读量3.6k

点赞数 4

分类专栏： SLAM论文笔记文章标签：神经网络算法计算机视觉 python 机器学习

本文链接：https://blog.csdn.net/weixin_44682965/article/details/108065132

版权

SLAM论文笔记专栏收录该内容

5 篇文章

订阅专栏

PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization论文总结

原文地址：https://download.csdn.net/download/weixin_44682965/12719622
代码地址（pytorch实现）：https://download.csdn.net/download/weixin_44682965/12719620
数据集：http://mi.eng.cam.ac.uk/projects/relocalisation/

1.Introduction

几个例子：
top：原图
middle：根据预测的相机pose重建的图
bottom：原图和重建的图重合
在这里插入图片描述

The algorithm is simple in the fact that it consists of a convolutional neural 
network(convnet) trained end-to-end to regress the camera’s orientation and position

深度卷积神经网络摄像机位姿回归器：算法训练CNN网络，输入是作者拍摄的224x224的RGB图像，输出是相机的位姿信息。

主要贡献1：

 We leverage transfer learning from recognition to relocalization with very large 
 scale classification datasets.

利用大规模数据集从识别到重定位的迁移学习，在大规模数据上预训练classifer，训练出regressor

Additionally we use structure from motion to  automatically generate training labels 
(camera poses) from a video of the scene. This reduces the human labor in creating
 labeled video datasets to just recording  the video.

利用SfM（camera pose）自动生成标签，大大减少了标记数据的工作量

主要贡献2：

Our second main contribution is towards understanding the representations that this 
convnet generates. We show that the system learns to compute feature vectors which
are easily mapped to pose, and which also generalize to unseen scenes with a few 
additional training samples

理解卷积网络的表示形式，展示了系统学会计算容易映射到pose的特征向量，并且还可以通过一些额外的训练样本推广到其它场景。

 We introduce a new framework for localization which removes several issues faced by
 typical SLAM pipelines, such as the need to store densely spaced keyframes, the 
 need to maintain separate mechanisms for appearance-based localization and 
 landmark-based pose estimation, and a need to establish frame-to-frame feature 
 correspondence.

论文中提出的框架解决了传统SLAM的几个问题：需要储存很多关键帧、需要维持定位和估计位姿单独的机制和需要建立帧对帧的特征对应

We do this by mapping monocular images to a high-dimensional representation that is 
robust to nuisance variables. We empirically show that this representation is a 
smoothly varying injective (one-to-one) function of pose, allowing us to regress 
pose directly from the image without need of tracking.

论文提出的框架通过讲单目图像映射到对扰动变量具有鲁棒性的high-dimensional来实现，这种表示方法是一种一对一的函数，不需要跟踪，可以直接从图像中回归出pose

2.Related work

metric SLAM ：

 Metric SLAM localizes a mobile robot by focusing on creating a sparse [13, 11] or 
 dense[16, 7] map of the environment. Metric SLAM estimates the camera’s continuous 
 pose, given a good initial pose estimate.

metric SLAM建立稠密或稀疏的地图定位，得到连续的pose

Appearance-based localization provides this coarse estimate by classifying the scene 
among a limited number of discrete locations. Scalable appearance-based localizers 
have been proposed such as [4] which uses SIFT features [15] in a bag of words 
approach to probabilistically recognize previously viewed scenery. Convnets have 
also been used to classify a scene into one of several location labels

Appearance-based localization 通过对有限的离散位置场景进行分类来得到粗略的估计，已经提出了可扩展的 appearance-based localizers，使用词袋中的SIFT特征识别先前的场景

卷积还被用于将场景分类为几个位置坐标之一

Our approach combines the strengths of these approaches: it does not need an initial 
pose estimate, and produces a continuous pose. Note we do not build a map, rather we 
train a neural network, whose size, unlike a map, does not require memory linearly 
proportional to the size of the scene (see fig. 13).

论文中的方法结合了它们的优势：不需要初始pose估计就可以产生连续pose。并且没有构建地图，而是训练了一个神经网络，该神经网络的大小与地图不同，不需要与场景大小成线性比例的内存

Our work most closely follows from the Scene Coordinate Regression Forests for 
relocalization proposed in [20].This algorithm uses depth images to create scene 
coordinate labels which map each pixel from camera coordinates to global scene 
coordinates. This was then used to train a regression forest to regress these labels 
and localize the camera. However, unlike our approach, this algorithm is limited to 
RGB-D images to generate the scene coordinate label, in practice constraining its 
use to indoor scenes.

论文中的方法最接近用于场景重新定位的Scene Coordinate Regression Forests。该算法使用深度图像创建场景坐标标签，该标签将每个像素映射到全局场景坐标中。然后，它被用来训练回归森林以回归这些标签并定位相机。但是，该算法仅限于RGB-D图像来生成场景坐标标签，实际上将其使用限制在室内场景中。

3. Model for deep regression of camera pose

输出：pose位姿，由位置x 和方向q（四元数）表示
p = [x, q]

3.1. Simultaneously learning location and orientation

为了回归出pose信息，论文使用的目标损失函数：随机梯度下降训练卷积神经网络上的欧几里得损失：

在这里插入图片描述
β是比例因子，使位置和方向误差的期望值保持近似相等

We found that training individual networks to regress position and orientation separately performed 
poorly compared to when they were trained with full 6-DOF pose labels (fig. 2). With just position, or 
just orientation information, the convnet was not as effectively able to determine the function 
representing camera pose. We also experimented with branching the network lower down into two separate 
components to regress position and orientation. However, we found that it too was less effective, for 
similar reasons: separating into distinct position and orientation regressors denies each the information 
necessary to factor out orientation from position, or vice versa.

与使用完整的6自由度pose标签进行训练相比，训练单个网络分别回归位置和方向的效果较差，如图。仅凭位置或方位信息，卷积网络就无法有效地确定代表相机姿态的功能。
我们还尝试了将网络向下分为两个单独的组件以回归位置和方向。但是，由于类似的原因，它的有效性也较低。
将回归变量分为不同的位置和方向会拒绝每一个从位置上排除方向的必要信息，反之亦然。
在这里插入图片描述

β平衡orientation误差和translation误差，最佳的β是由训练结束时（而非开始时预期）的orientation误差和translation误差之比给出的，对于户外场景，β会更大，因为translation误差倾向于相对更大，照此，我们使用grid search对β进行了微调：室内场景介于120到750之间，室外场景介于250到2000之间。

We found it was important to randomly initialize the final position regressor layer so that the norm of 
the weights corresponding to each position dimension was proportional to that dimension’s spatial extent.

论文发现随机初始化最终位置回归器层很重要，使每个position维度相对应的权重范数和该维度的空间范围成正比

Classification problems have a training example for every category. This is not possible for regression 
as the Figure 2: Relative performance of position and orientation regression on a single convnet with a 
range of scale factors for an indoor scene, Chess. This demonstrates that learning with the optimum scale 
factor leads to the convnet uncovering a more accurate pose function. output is continuous and infinite. 
Furthermore, other convnets that have been used for regression operate off very large datasets [25, 19]. 
For localization regression to work off limited data we leverage the powerful representations learned off 
these large classification datasets by pretraining the weights on these datasets.

如图2所示，对于回归而言，在具有一定比例的单个convent上针对室内场景Chess的位置和方向回归的相对性能不可能的，所以以最佳比例因子进行学习会导致卷积揭示更准确的姿势函数，输出是连续且无限的

3.2. Architecture

For the experiments in this paper we use a state of the art deep neural network architecture for 
classification, GoogLeNet [24], as a basis for developing our pose regression network. GoogLeNet is a 22 
layer convolutional network with six ‘inception modules’ and two additional intermediate classifiers 
which are discarded at test time. Our model is a slightly modified version of GoogLeNet with 23 layers
(counting only the layers with trainable parameters).
We modified GoogLeNet as follows:
• Replace all three softmax classifiers with affine regressors. The softmax layers were removed and each 
final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3)
and orientation (4).
• Insert another fully connected layer before the final regressor of feature size 2048. This was to form 
a localization feature vector which may then be explored for generalisation.
• At test time we also normalize the quaternion orientation vector to unit length.

论文结构以GoogLeNet 作为回归网络的基础。GooLeNet是一个22层的网络，有6个“inception modules”和两个附加的中间分类器（分类器在测试时被丢弃），我们的模型是GoogLeNet的略微修改版本，具有23层（仅计算具有可训练参数的层）。
我们对GoogLeNet进行了如下修改：

用affine regressors（神经网络的正向传播中进行的矩阵乘积运算在几何学领域被称为“仿射变换（Affine））替换三个softmax分类器。移除softmax层，最后的层改为全连接层，输出位置矢量(3维)和方向矢量(4维)的7维pose矢量
在特征大小为2048的最终回归器之前插入另一个全连接层，这是为了形成一个定位特征向量，对其进行探索概括
在测试时，我们还将四元数方向矢量标准化为单位长度

We rescaled the input image so that the smallest dimension was 256 pixels before cropping to the 224x224 
pixel input to the GoogLeNet convnet. The convnet was trained on random crops (which do not affect the 
camera pose). At test time we evaluate it with both a single center crop and also densely with 128 
uniformly spaced crops of the input image, averaging the resulting pose vectors. With parallel GPU 
processing, this results in a computational time increase from 5ms to 95ms per image.

重新缩放了图像的尺度，最小的为256像素，再变成224x224像素输入GoogLeNet convent。卷积网络在random crops（不影响相机pose）上进行训练，测试时，不仅对single center crop进行评估，而且对输出图像的128个uniformly spaced crops评估，并对得到的pose vectors进行平均

We experimented with rescaling the original image to different sizes before cropping for training and 
testing.Scaling up the input is equivalent to cropping the input before downsampling to 256 pixels on one 
side. This increases the spatial resolution of the input pixels. We found that this does not increase the 
localization performance, indicating that context and field of view is more important than resolution for 
relocalization.

试验中讲原始图像缩放为不同的大小，再进行训练和测试。放大输入等同于在一侧下采样到256像素之前裁剪输入。这增加了输入像素的空间分辨率，发现这不会提高定位性能，表明上下文和视野比分辨率更重要。

The PoseNet model was implemented using the Caffe library [10]. It was trained using stochastic gradient 
descent with a base learning rate of 10 − 5, reduced by 90% every 80 epochs and with momentum of 0.9. 
Using one half of a dual-GPU card (NVidia Titan Black), training took an hour using a batch size of 75. 
For reasons of time, we did not explore multi-GPU training, although it is reasonable to expect better 
results from using double the throughput and memory. We subtracted a separate image mean for each scene 
as we found this to improve experimental performance.

使用随机梯度下降训练，基本learning rate为10~5，每80个epoch减少90%。每个场景减去一个单独的图像的平均值，可以提高实验性能。

4. Dataset

本文利用SfM自动生成场景标签，减少所用数据集的大小
室外数据：Cambridge Landmarks 1，具有5个场景的室外城市localization数据集

如表：

在这里插入图片描述数据集中出现了诸如行人和车辆之类的重要城市杂物，并且从代表不同照明和天气条件的许多不同时间点收集了数据，训练和测试图像是从不同的步行路径中获取的，而不是从相同的轨迹中采样的，这使得回归具有挑战性，如图3
在这里插入图片描述

The dataset was generated using structure from motion techniques [28] which we use as ground truth 
measurements for this paper. A Google LG Nexus 5 smartphone was used by a pedestrian to take high 
definition video around each scene. This video was subsampled in time at 2Hz to generate images to input 
to the SfM pipeline. There is a spacing of about 1m between each camera position.
To test on indoor scenes we use the publically available 7 Scenes dataset [20], with scenes shown in 
fig. 5. This dataset contains significant variation in camera height and was designed for RGB-D 
relocalization. It is extremely challenging for purely visual relocalization using SIFT-like features, as 
it contains many ambiguous textureless features.

数据集是由SfM生成的，用于标记pose，行人使用Google LG Nexus 5智能手机在每个场景周围拍摄高清视频。该视频以2Hz的速度在时间上进行了二次采样，以生成图像以输入到SfM管道。每个摄像机位置之间的距离约为1m。

俯视图：
在这里插入图片描述

数据集是使用运动技术[28]的结构生成的，我们将其用作本文的地面真实性测量。行人使用Google LG Nexus 5智能手机在每个场景周围拍摄高清视频。该视频以2Hz的速度在时间上进行了二次采样，以生成图像以输入到SfM管道。每个摄像机位置之间的距离约为1m。
在这里插入图片描述

室内场景测试，使用了公开可用的7个场景数据集，如图 5。该数据集包含相机高度的显着变化，并设计用于RGB-D重新定位。对于使用类似于SIFT的功能的纯视觉重新定位来说，这非常具有挑战性，因为它包含许多ambiguous textureless features。

5. Experiments

To validate that the convnet is regressing pose beyond that of the training examples we show the 
performance for finding the nearest neighbour representation in the training data from the feature vector 
produced by the localization convnet. As our performance exceeds this we conclude that the convnet is 
successfully able to regress pose beyond training examples

当超出最近邻表示时，我们得出结论，卷积模型能够成功地回归pose超过训练示例

We also compare our algorithm to the RGB-D SCoRe Forest algorithm [20]. Fig. 7 shows cumulative 
histograms of localization error for two indoor and two outdoor scenes. We note that although the SCoRe
forest is generally more accurate, it requires depth information, and uses higherresolution imagery. The 
indoor dataset contains many ambiguous and textureless features which make relocalization without this 
depth modality extremely difficult. We note our method often localizes the most difficult testing frames, 
above the 95th percentile, more accurately than SCoRe across all the scenes. We also observe that dense 
cropping only gives a modest improvement in performance. It is most important in scenes with significant 
clutter like pedestrians and cars, for example King’s College, Shop Façade and St Mary’s Church.

论文还将算法和RGB-D SCoRe Froest算法进行比较。图7展示了两个室内和两个室外场景的定位误差的累积直方图。尽管SCoRe froest 通常更准确，但它需要深度信息并使用更高分辨率的图像。室内数据集包含许多ambiguous and textureless features，这些特征使得没有这种深度模态的重新定位变得非常困难。在所有场景中，本文的方法通常比SCoRe更准确地定位最困难的测试帧（高于第95个百分点）。还观察到，dense cropping只会给性能带来适度的改善。这在行人和汽车等杂乱无章的场景中最为重要，例如 King’s College, Shop Façade and St Mary’s Church。

在这里插入图片描述

We explored the robustness of this method beyond what was tested in the dataset with additional images 
from dusk, rain, fog, night and with motion blur and different cameras with unknown intrinsics. Fig. 8 
shows the convnet generally handles these challenges well. SfM with SIFT fails in all these cases so we 
were not able to generate a ground truth camera pose, however we infer the accuracy by viewing the 3D 
reconstruction from the predicted camera pose, and overlaying this onto the input image.

鲁棒性：黄昏，雨天，雾天，夜晚，运动模糊和内参不同的相机。图8显示了convnet可以很好地应对这些挑战。使用SIFT的SfM在所有这些情况下均会失败，因此我们无法产生label，但是我们通过从预测的摄影机pose重建的3D场景叠加到输入图像上来推断准确性。
在这里插入图片描述

5.1. Robustness against training image spacing

We demonstrate in fig. 9 that, for an outdoor scale scene,we gain little by spacing the training images 
more closely than 4m. The system is robust to very large spatial separation between training images, 
achieving reasonable performance even with only a few dozen training samples. The pose accuracy 
deteriorates gracefully with increased training image spacing, whereas SIFT-based SfM sharply fails after 
a certain threshold as it requires a small baseline

从图9可以看出，对于室外规模的场景，通过将训练图像间隔更近于4m不会获得什么收益，该系统对于训练图像之间非常大的空间分隔具有鲁棒性，即使只有几十个训练样本也能实现合理的性能。pose精度着训练图像间距的增加而下降，而基于SIFT的SfM在某个阈值后会急剧失败，因为它需要较小的基线。

在这里插入图片描述

5.2. Importance of transfer learning

我们通过从对大型数据集（例如ImageNet和Places）进行预训练，来避免卷积网络需要大量的训练数据的问题。图10显示了如何在分类和复杂的回归任务之间有效地利用迁移学习。

here we demonstrate transfer learning from classification to the qualitatively different task of pose 
regression. It is not immediately obvious that a network trained to output pose-invariant classification 
labels would be suitable as a starting point for a pose regressor. We find, however, that this is not a 
problem in practice. A possible explanation is that, in order for its output to be invariant to pose, the 
classifier network must keep track of pose, to better factor its effects away from identity cues. This 
would agree with our own findings that a network trained to output position and orientation outperforms a 
network trained to output only position. By preserving orientation information in the intermediate 
representations, it is better able to factor the effects of orientation out of the final position 
estimation. Transfer learning gives not only a large improvement in training speed, but also end 
performance.

这里演示了从分类到回归的不同任务的学习。训练输出pose-invariant分类标签网络是否适合作为 pose regressor 的起点并不立即显而易见。但是，这实际上不是问题，一个可能的解释是：为了输出不变的pose，分类器网络必须跟踪姿态，，这与发现一致，即训练输出位置和方向的网络要优于训练仅输出位置的网络，通过在中间表示中保留方向信息，可以更好地从最终位置估计中排除方向的影响。转移学习不仅可以大大提高训练速度，而且可以提高最终表现。

在这里插入图片描述

5.3. Visualising features relevant to pose

saliency map：损失函数相对于像素强度的梯度大小，这将pose相对于像素的敏感度用作卷积网络考虑图像不同部分的重要性的指标

hese results show that the strongest response is observed from higher-level features such as windows and 
spires. However a more surprising result is that PoseNet is also very sensitive to large textureless 
patches such as road, grass and sky. These textureless patches may be more informative than the highest 
responding points because the effect of a group of pixels on the pose variable is the sum of the saliency 
map values over that group of pixels. This evidence points to the net being able to localize off 
information from these textureless surfaces, something which interest-point based features such as SIFT 
or SURF fail to do. The last observation is that PoseNet has an attenuated response to people and other
noisy objects, effectively masking them. These objects are dynamic, and the convnet has identified them 
as not appropriate for localization.

从higher-level特征例如窗户和尖顶）观察到了最强烈的响应，并且，PoseNet对大型textureless patches（例如道路，草地和天空）也非常敏感，这些textureless patches可能比最高响应点更具信息性，因为一组像素对pose的影响是该组像素上的 saliency
map values的总和，网络能够从这些无纹理的表面中定位信息，而SIFT等无法做到这一点。最后的观察结果是，PoseNet对人和其他嘈杂物体的响应减弱，有效地掩盖了它们。这些对象是动态的，卷积网络已将其标识为不适用于定位。

在这里插入图片描述

5.4. Viewing the internal representation

t-SNE：一种算法，保持欧几里得距离的方式将高维数据嵌入到低维空间中，通常用于可视化高维特征向量到二维。

we apply t-SNE to the feature vectors computed from a sequence of video frames taken by a pedestrian. As 
these figures show, the feature vectors are a function that smoothly varies with, and is largely 
one-to-one with, pose. This ‘pose manifold’ can be observed not only on networks trained on other scenes, 
but also networks trained on classification image sets without pose labels. This further suggests that 
classification convnets preserve pose information up to the final layer, regardless of whether it’s 
expressed in the output. However, the mapping from feature vector to pose becomes more complicated for 
networks not trained on pose data. Furthermore, as this manifold exists on scenes that the convnet was not
trained on, the convnet must learn some generic representation of the relationship between landmarks, 
geometry and camera motion. This demonstrates that the feature vector that is produced from regression is
able to generalize to other tasks in the same way as classification convnets.

将t-SNE应用于从行人拍摄的一系列视频帧计算特征向量，特征向量是随pose平滑变化的one-to-one的函数，这种“pose manifold”不仅可以在其他场景训练的网络上观察到，而且可以在没有pose标签的分类图像集上训练的网络上观察到。进一步表明，分类卷积可以保留姿势信息直到最后一层，无论其是否在输出中表示出来，但是，对于未经pose数据训练的网络，从特征向量到pose的映射变得更加复杂。此外，由于pose manifol存在于未经训练的卷积网络的场景中，因此卷积网络必须学习地标、几何形状和摄像机运动之间关系的某种通用表示形式。这证明了通过回归生成的特征向量能够以与分类卷积网相同的方式推广到其他任务。

在这里插入图片描述

5.5. System efficiency

Our network is very scalable, as it only takes 50 MB to store the weights, and 5ms to compute each pose, 
compared to the gigabytes and minutes for metric localization with SIFT. These values are independent of 
the number of training samples in the system while metric localization scales O(n²) with training data 
size [28].For comparison matching to the convnet nearest neighbour is also shown. This requires storing 
feature vectors for each training frame, then perform a linear search to find the nearest neighbour for a
 given test frame.

poseNet网络具有很高的可扩展性，仅需要50 MB的存储权重和5ms的计算每个pose。为了进行比较，还显示了与convnet最近邻的匹配。这要求为每个训练帧存储特征向量，然后执行线性搜索以找到给定测试帧的最近邻。
在这里插入图片描述

6. Conclusions

We present, to our knowledge, the first application of deep convolutional neural networks to end-to-end 
6-DOF camera pose localization. We have demonstrated that one can sidestep the need for millions of 
training images by use of transfer learning from networks trained as classifiers. We showed that such 
networks preserve ample pose information in their feature vectors, despite being trained to produce 
pose-invariant outputs. Our method tolerates large baselines that cause SIFT-based localizers to fail 
sharply.

已经证明，通过使用作为分类器训练的网络的迁移学习，可以避免对数百万张训练图像的需求。我们表明，尽管经过训练可以生成pose-invariant的输出，但此类网络在其特征向量中保留了足够的pose信息。
我们的方法可以承受较大的基线，这会导致基于SIFT的急剧失败。
在未来的工作中，旨在进一步利用多视图几何作为深层姿态回归器训练数据的来源，并探索对该算法的概率扩展。

代码运行结果：
在这里插入图片描述