【定位系列论文阅读】-InLoc: Indoor Visual Localization with Dense Matching and View Synthesis (二)

第二段(本方法的匹配效果)

We first find geometrically consistent sets of correspondences using the coarser conv5 layer containing high-level information.
我们首先使用包含高层次信息的较粗糙的 conv5 层找到几何上一致的对应集。

Then we refine the correspondence by search ing for additional matches on the conv3 layer.
然后,我们通过在 conv3 层搜索更多匹配项来完善对应关系。

Examples in figure 4 demonstrate that our dense CNN matching (4th column) obtains better matches in indoor environments when compared to matching standard local features (3rd column), even for less-textured areas.
图 4 中的示例表明,与匹配标准局部特征(第 3 列)相比,我们的密集 CNN 匹配(第 4 列)在室内环境中获得了更好的匹配效果,即使在纹理较少的区域也是如此。
在这里插入图片描述
不同定位方法(列)的定性比较。从上至下依次为:查询图像、最匹配的数据库图像、估计姿态合成视图(未进行插值/额外插值)、查询图像与合成视图之间的误差图、定位误差(米、度)。绿点为P3P-LO-RANSAC获得的内匹配。使用所提出的密集姿态估计(DensePE)和密集姿态验证(DensePV)的方法以粗体显示。第2列、第4列和第6列查询图像在1.0米和5.0度范围内定位良好,而第1列、第3列和第5列定位结果不正确。

Notice that dense-feature extraction and description requires no additional computation at query time as the intermediate convolutional layers are already computed when extracting the NetVLAD descriptors as described in section 4.1.
请注意,密集特征提取和描述不需要在查询时进行额外计算,因为在提取 NetVLAD 描述符时已经计算了中间卷积层,如第 4.1 节所述。

As will also be demonstrated in section 5, memory requirements and computational speed of feature matching can be addressed by binarizing the convolutional features without loss in matching performance.
第 5 节中还将演示,通过对卷积特征进行二值化处理,可以在不降低匹配性能的情况下满足内存要求并加快特征匹配的计算速度。

As perspective images in our database have depth values, and hence associated 3D points, the query camera pose can be estimated by finding pixel-to-pixel correspondences between the query and the matching database image followed by P3P-RANSAC [25].
由于我们数据库中的透视图像具有深度值,因此具有相关的3D点,因此可以通过查找查询与匹配数据库图像之间的像素对像素对应关系来估计查询相机姿态,然后进行P3P-RANSAC [25]

4.3. Pose verification with view synthesis 用视图合成进行姿态验证

We propose here to collect both positive and negative evidence to determine what is and is not matched3 .
我们建议在这里收集正面和负面的证据,以确定哪些是匹配的,哪些是不匹配的。

This is achieved by harnessing the power of the high-quality RGBD image database that provides a dense and accurate 3D structure of the indoor environment.
这是通过利用高质量RGBD图像数据库的力量来实现的,该数据库提供了密集而准确的室内环境3D结构。

This structure is used to render a virtual view that shows how the scene would look like from the estimated query pose.
这个结构用于渲染一个虚拟视图,该视图显示了从估计的查询姿态来看场景的样子。

The rendered image enables us to count, in a pixel-wise manner, both positive and negative evidence by counting which regions are and are not consistent between the query image and the underlying 3D structure.
渲染的图像使我们能够通过计算查询图像和底层3D结构之间哪些区域是一致的,哪些区域是不一致的,从而以像素方式计算正面和负面证据。

To gain invariance to illumination changes and small misalignments, we evaluate image similarity by comparing local patch descriptors (DenseRootSIFT [7, 41]) at corresponding pixel locations.
为了获得光照变化和小偏差的不变性,我们通过比较相应像素位置的局部补丁描述符(DenseRootSIFT[7,41])来评估图像相似性。

The final similarity is computed as the median of descriptor distances across the entire image while ignoring areas with missing 3D structure
最终的相似度计算为整个图像描述符距离的中位数,同时忽略缺少3D结构的区域

5. Experiments 实验

We first describe the experimental setup for evaluating visual localization performance using our dataset (Section 5.1).
我们首先描述了使用我们的数据集评估视觉定位性能的实验设置(第5.1节)。
The proposed method, termed “InLoc”, is compared with state-of-the-art methods (Section 5.2) and we show the benefits of each component in detail (Section 5.3).
提出的方法,称为“InLoc”,与最先进的方法(第5.2节)进行比较,我们详细展示了每个组件的优点(第5.3节)。

5.1. Implementation details 实现细节

In the candidate pose retrieval step, we retrieve 100 candidate database images using NetVLAD.
在候选姿态检索步骤中,我们使用NetVLAD检索100张候选数据库图像。

We use the implementation provided by the authors and the pre-trained Pitts30K [6] VGG-16 [61] model to generate 4; 096- dimensional NetVLAD descriptor vectors.
我们使用作者提供的实现和预训练的匹兹堡30k [6] VGG-16[61]模型生成4;96维NetVLAD描述符向量。

In the second pose estimation step, we obtain tentative correspondences by matching densely extracted convolutional features in a coarse-to-fine manner: we first find mutually nearest matches among the conv5 features and then find matches in the finer conv3 features restricted by the coarse conv5 correspondences.
在第二步姿态估计中,我们通过以粗到精的方式匹配密集提取的卷积特征来获得暂定对应:我们首先在conv5特征之间找到最接近的匹配,然后在受粗conv5对应限制的更精细的conv3特征中找到匹配。

The tentative matches are geometrically verified by estimating up to two homographies using RANSAC [25]. We re-rank the 100 candidates using the number of RANSAC inliers and keep the top-10 database images.
通过使用RANSAC估计最多两个同形异构词,对暂定匹配进行几何验证[25]。我们使用RANSAC内嵌器的数量对100个候选图像重新排序,并保留前10个数据库图像。

For each of the 10 images, the 6DoF query pose is computed by P3P-LO-RANSAC [37] (referred to as DensePE), assuming a known focal length, e.g., from EXIF data, using the inlier matches and depth (i.e. the 3D structure) associated to each database image.
对于10张图像中的每一张,使用P3P-LO-RANSAC37计算6DoF查询姿态,假设已知焦距,例如来自EXIF数据,使用与每个数据库图像相关的内层匹配和深度(即3D结构)。

In the final pose verification step, we generate synthesized views by rendering colored 3D points while taking care of self-occlusions.
在最后的姿态验证步骤中,我们通过渲染彩色三维点生成合成视图,同时处理自遮挡。

For computing the scores that measure the similarities of the query image and the image rendered from the estimated pose, we use the DenseSIFT extractor and its RootSIFT descriptor [7, 41] from VLFeat [71]4 .
为了计算衡量查询图像和根据估计姿势渲染的图像相似度的分数,我们使用了 VLFeat [71]4 中的 DenseSIFT 提取器及其 RootSIFT 描述符 [7, 41] 。

Finally, we localize the query image by the best pose among its top-10 candidates
最后,我们通过前 10 个候选姿势中的最佳姿势定位查询图像

Evaluation metrics. We evaluate the localization accuracy as the consistency of the estimated poses with our reference poses.
评价指标。我们用估计姿态与参考姿态的一致性来评估定位精度。

We measure positional and angular differences in meters and degrees between the estimated poses and the manually verified reference poses.
我们测量位置和角度的差异,在米和度之间的估计姿态和手动验证的参考姿态。

5.2. Comparison with the state-of-the-art methods(与最先进的方法比较)

Direct 2D-3D matching [53, 55]. We first compare with a variation5 of a state-of-the-art 3D structure-based image localization approach [53].
直接 2D-3D 匹配 [53, 55]。我们首先与最先进的基于三维结构的图像定位方法[53]的一个变体5 进行比较。

We compute affine covariant RootSIFT features for all the database images and associate them with 3D coordinates via the known scene geometry.
我们计算所有数据库图像的仿射协变 RootSIFT 特征,并通过已知的场景几何将其与三维坐标相关联。

Features extracted from a query image are then matched to the database 3D descriptors [46].
然后将从查询图像中提取的特征与数据库三维描述符进行匹配[46]。

We select at most five database images receiving the largest numbers of matches and use all these matches together for pose estimation.
我们最多选择五张匹配度最高的数据库图像,并将所有这些匹配图像一起用于姿态估计。

Similar to [53], we did not apply Lowe’s ratio test [42] as it lowered the performance. The 6DoF query pose is finally computed by P3P-LO-RANSAC [37].
与文献[53]类似,我们没有采用洛氏比率测试[42],因为它会降低性能。6DoF 查询姿势最终由 P3P-LO-RANSAC [37] 计算得出。

As shown in table 2, In-Loc outperforms direct 2D-3D matching by a large margin (40:7% at the localization accuracy of 0.5m).
如表 2 所示,In-Loc 的性能远远超过直接 2D-3D 匹配(在定位精度为 0.5m 时为 40:7%)。
在这里插入图片描述
与最新定位方法在InLoc数据集上的比较。我们显示了在给定距离(m)阈值和10°角误差阈值内正确定位查询的比率(%)。

We believe that this is because our large-scale indoor dataset involves many distractors and large viewpoint changes that present a major challenge for 3D structure-based methods.
我们认为,这是因为我们的大规模室内数据集涉及许多干扰因素和较大的视角变化,这对基于三维结构的方法是一个重大挑战。

Disloc [9] + sparse pose estimation (SparsePE) [51]. We next compare with the state-of-the-art image retrieval-based localization method.
Disloc[9] +稀疏姿态估计(SparsePE)[51]。接下来,我们将与基于图像检索的最先进的定位方法进行比较。

Disloc represents images using bagof-visual-words with Hamming-Embedding [31] while also
taking local descriptor space density into account.
Disloc使用带有Hamming-Embedding的bag - of-visual-words来表示图像[31],同时也考虑了局部描述子空间密度。

We use a publicly available implementation [54] of Disloc with a 200K vocabulary trained on affine covariant features [45], described by RootSIFT [7], extracted from the database images of our indoor dataset.
我们使用了一个公开可用的Disloc实现[54],该实现采用仿射协变特征[45]训练的200K词汇表,由RootSIFT[7]描述,从我们的室内数据集的数据库图像中提取。

The top-100 candidate images shortlisted by Disloc are re-ranked by spatial verification [51] using (sparse) affine covariant features [45].
Disloc入围的前100张候选图像通过使用(稀疏)仿射协变特征[45]进行空间验证[51]重新排序。

The ratio test [42] was not applied here as it was removing too many features that need to be retained in the indoor scenario.
比值测试[42]在这里没有应用,因为它删除了太多需要保留在室内场景中的特征。

Using the inliers, the 6DoF query pose is computed with P3P-LO-RANSAC [37].
使用内线,使用P3P-LO-RANSAC计算6DoF查询姿态[37]。

To make a fair comparison, we use exactly the same features and P3P-LORANSAC for pose estimation as the direct 2D-3D matching method described above.
为了进行公平的比较,我们使用与上述直接2D-3D匹配方法完全相同的特征和P3P-LORANSAC进行姿态估计。

As shown in table 2, Disloc [9]+SparsePE [51] results in a 13:7% performance gain compared to Direct 2D-3D matching [55].
如表2所示,与Direct 2D-3D匹配相比,Disloc [9]+SparsePE[51]的性能提高了13:7%[55]。

This can be attributed to the image retrieval step that discounts burst of repetitive features.
这可以归因于图像检索步骤中对重复特征的抑制。

However, the results are still significantly worse compared to our InLoc approach
然而,与我们的InLoc方法相比,结果仍然明显更差

NetVLAD [6] + sparse pose estimation (SparsePE) [51].
NetVLAD[6] +稀疏姿态估计(SparsePE)[51]。

We also evaluate a variation of the above image retrievalbased localization method.
我们还评估了上述基于图像检索的定位方法的一种变体。

Here the candidate shortlist is obtained by NetVLAD [6], which is then re-ranked using SparsePE [51], followed by pose estimation using P3PLO-RANSAC [37].
在这里,候选候选名单由NetVLAD获得[6],然后使用SparsePE[51]重新排名,然后使用P3PLO-RANSAC进行姿态估计[37]。

This is a strong baseline building on the state-of-the-art place recognition results obtained by [6]. Interestingly, as shown in table 2, there is no significant difference between NetVLAD+SparsePE and DisLoc+SparsePE, which is in line with results reported in outdoor settings [57].
这是基于[6]获得的最先进的位置识别结果建立的强大基线。有趣的是,如表2所示,NetVLAD+SparsePE与DisLoc+SparsePE之间没有显著差异,这与在室外环境下报道的结果一致[57]。

Yet, NetVLAD outperforms DisLoc (5:8% at the localization accuracy of 0.5m) before reranking via SparsePE (c.f . figure 5) in this indoor setting(see also figure 4).
然而,在通过SparsePE (c.f.c)重新排序之前,NetVLAD在0.5m的定位精度下优于DisLoc(5:8%)。

Overall, both methods, even though they represent the state-of-the-art in outdoor localization, still perform significantly worse than our proposed approach based on dense feature matching and view synthesis.
图5)(另见图4)。总的来说,尽管这两种方法都代表了室外定位的最新技术,与我们提出的基于密集特征匹配和视图合成的方法相比,其性能仍然要差很多。

5.3. Evaluation of each component

Next, we demonstrate the benefits of the individual components of our approach.
接下来,我们将演示我们的方法的各个组件的好处。

Benefits of pose estimation using dense matching. Using the NetVLAD retrieval as the base retrieval method
(Figure 5 (a)), our pose estimation with dense matching (NetVLAD [6]+DensePE (blue line)) constantly improves the localization rate by about 15% when compared to the state-of-the-art sparse local feature matching (NetVLAD [6]+SparsePE (green line)).
使用密集匹配进行姿态估计的好处。使用NetVLAD检索作为基础检索方法(图5 (a)),我们的密集匹配姿态估计(NetVLAD [6]+DensePE(蓝线))与最先进的稀疏局部特征匹配(NetVLAD [6]+SparsePE(绿线))相比,定位率不断提高约15%。
在这里插入图片描述
不同组件的影响。图中显示了密集匹配(DensePE)和密集姿态验证(DensePV)对(a) NetVLAD检索的姿态候选和(b)最先进基线的姿态估计质量的影响。图表显示了在一定距离(x轴)内正确定位查询(y轴)的比例,其旋转误差最多为10◦。

This result supports our conclusion that dense feature matching and verification is superior to sparse feature matching for often weakly textured indoor scenes.
这一结果支持了我们的结论,即对于经常弱纹理的室内场景,密集特征匹配和验证优于稀疏特征匹配。

This effect is also clearly demonstrated in qualitative results in figure 4 (cf. columns 3 and 4).
图4的定性结果也清楚地证明了这种影响(参见第3和第4列)。

Benefits of pose verification with view synthesis. We apply our pose verification step (DensePV) to the top–10 pose estimates obtained by different spatial re-reranking methods.
用视图合成进行姿态验证的好处。我们将姿态验证步骤(DensePV)应用于不同空间重排序方法获得的前10个姿态估计。

Results are shown in figure 5 and demonstrate significant and consistent improvements obtained by our pose verification approach (compare “-•-” to “—” in figure 5).
结果如图5所示,并展示了我们的姿势验证方法获得的显著和一致的改进(比较图5中的“-•-”和“-”)。

Improvements are most pronounced for the position accuracy within 1.5 meters (13% or more).
改进是最明显的位置精度在1.5米(13%或更高)。

Binarized representation. A binary representation (instead of floats) of features in the intermediate CNN layers
significantly reduces memory requirements.
关键表示。在中间的CNN层中,特征的二进制表示(而不是浮点数)显著降低了内存需求。

We use feature binarization that follows the standard Hamming embedding approach [31] but without dimensionality reduction.
我们使用遵循标准Hamming嵌入方法的特征二值化[31],但没有降维。

Matching is then performed by computing Hamming distances.
然后通过计算汉明距离进行匹配。

This simple binarization scheme results in a negligible performance loss (less than 1% at 0.5 meters) compared to the original descriptors, which is in line with results reported for object recognition [4].
与原始描述符相比,这种简单的二值化方案导致的性能损失可以忽略不计(0.5米处小于1%),这与物体识别的结果一致[4]。

At the same time, binarization reduces the memory requirements by a factor of 32, compressing 428GB of original descriptors to just 13.4GB.
同时,二值化将内存需求减少了32倍,将428GB的原始描述符压缩到13.4GB。

Comparison with learning based localization methods.
与基于学习的定位方法的比较。

We have attempted a comparison with DSAC [11], which is a state-of-the-art pose estimator for indoor scenes.
我们尝试与DSAC[11]进行比较,DSAC是一种最先进的室内场景姿态估计器。

Despite our best efforts, training DSAC on our indoor dataset failed to converge.
尽管我们尽了最大的努力,在我们的室内数据集上训练DSAC还是没有收敛。

We believe this is because the RGBD scans in our database are sparsely distributed [76] and each scan has only a small overlap with neighboring scans.
我们认为这是因为我们数据库中的RGBD扫描是稀疏分布的[76],每次扫描与相邻扫描只有很小的重叠。

Training on such a dataset is challenging for methods designed for densely captured RGBD sequences [26].
对于设计用于密集捕获RGBD序列的方法来说,在这样的数据集上进行训练是具有挑战性的[26]。

We believe this would also be the case for PoseNet [34], another method for CNN-based pose regression.
我们相信PoseNet[34]也是如此,PoseNet是另一种基于cnn的姿势回归方法。

We do provide the comparison with DSAC and PoseNet on much smaller datasets next.
接下来,我们将在更小的数据集上与DSAC和PoseNet进行比较。

5.4. Evaluation on other datasets 对其他数据集的评价

We also evaluate InLoc on two existing indoor datasets [17, 59] to confirm the relevance of our results.
我们还在两个现有的室内数据集上评估了InLoc[17,59],以确认我们结果的相关性。

The Matterport3D [17] dataset consists of RGBD scans of 90 buildings.
Matterport3D[17]数据集由90个建筑物的RGBD扫描组成。

Each RGBD scan contains 18 images that capture the scene around the scan position with known camera poses.
每次RGBD扫描包含18张图像,以已知的相机姿势捕获扫描位置周围的场景。

We created a test set by randomly choosing 10% of the scan positions and selected their horizontal views.
我们通过随机选择10%的扫描位置并选择它们的水平视图来创建一个测试集。

This resulted in 58,074 database images and a query set of 6,726 images.
这产生了58,074个数据库图像和6,726个图像的查询集。

Results are shown in table 3. Our approach (InLoc) outperforms the baselines, which is in line with results on the InLoc dataset.
结果如表3所示。我们的方法(InLoc)优于基线,这与InLoc数据集的结果一致。
在这里插入图片描述
Matterport3D的对比[17]。数字显示中间位置(m)和角度(度)误差

We also tested PoseNet [34] and DSAC [11] on a single (the largest) building.
我们还在单个(最大的)建筑物上测试了PoseNet[34]和DSAC[11]。

The test set is created in the same manner as above and contains 1,884 database images and 210 query images.
测试集以与上面相同的方式创建,包含1,884个数据库图像和210个查询图像。

Even in this much easier case, DSAC fails to converge.
即使在这种简单得多的情况下,DSAC也不能收敛。

PoseNet produces large localization errors (24.8 meters and 80.0 degrees) in comparison with InLoc (0.26 meters and 2.78 degrees).
与InLoc(0.26米和2.78度)相比,PoseNet产生了较大的定位误差(24.8米和80.0度)。

We also report results on the 7 Scenes dataset [26, 59] which is, while relatively small, a standard benchmark for indoor localization.
我们还报告了 7 个场景数据集 [26, 59]的结果,该数据集虽然相对较小,但却是室内定位的标准基准。

The 7 Scenes dataset [59] consists of geometrically-registered video frames representing seven scenes, together with associated depth images and camera poses.
7 种场景数据集 [59] 由代表 7 种场景的几何注册视频帧以及相关深度图像和摄像机姿势组成。

Table 4 shows localization results for our approach (NetVLAD+DensePE) compared with state-of-theart methods [11, 34, 55].
表 4 显示了我们的方法(NetVLAD+DensePE)与先进方法 [11, 34, 55] 的定位结果对比。
在这里插入图片描述
7 scene数据集的评价[26,59]。数字显示中间位置(厘米)和角度误差(度)。

Note that our approach performs comparably to these methods on this relatively small and densely captured data, while it does not need any scene specific training (which is needed by [11, 34]).
需要注意的是,我们的方法在这种相对较小的密集捕获数据上的表现与这些方法不相上下,而且不需要任何特定场景的训练([11, 34]需要这种训练)。

6. Conclusion

We have presented InLoc – a new approach for large-scale indoor visual localization that estimates the 6DoF camera
pose of a query image with respect to a large indoor 3D map.
我们提出了InLoc——一种用于大规模室内视觉定位的新方法,它可以根据大型室内3D地图估计查询图像的6DoF相机姿态。

To overcome the difficulties of indoor camera pose estimation, we have developed new pose estimation and verification methods that use dense feature extraction and matching in a sequence of progressively stricter verification steps.
为了克服室内相机姿态估计的困难,我们开发了新的姿态估计和验证方法,该方法使用密集的特征提取和匹配,在一系列逐步严格的验证步骤中进行。

The localization performance is evaluated on a new large indoor dataset with realistic and challenging query images captured by mobile phones.
在一个新的大型室内数据集上评估了定位性能,该数据集具有手机捕获的真实和具有挑战性的查询图像。

Our results demonstrate significant improvements compared to state-of-theart localization methods.
与最先进的定位方法相比,我们的结果显示了显著的改进。

To encourage further progress on high-accuracy large-scale indoor localization, we make our dataset publicly available [1].
为了鼓励在高精度大规模室内定位方面取得进一步进展,我们公开了我们的数据集[1]。

[注] :后文有附录,未翻译,请自行查看

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值