【定位系列论文阅读】-InLoc: Indoor Visual Localization with Dense Matching and View Synthesis (二)


We first find geometrically consistent sets of correspondences using the coarser conv5 layer containing high-level information.
我们首先使用包含高层次信息的较粗糙的 conv5 层找到几何上一致的对应集。

Then we refine the correspondence by search ing for additional matches on the conv3 layer.
然后,我们通过在 conv3 层搜索更多匹配项来完善对应关系。

Examples in figure 4 demonstrate that our dense CNN matching (4th column) obtains better matches in indoor environments when compared to matching standard local features (3rd column), even for less-textured areas.
图 4 中的示例表明,与匹配标准局部特征(第 3 列)相比,我们的密集 CNN 匹配(第 4 列)在室内环境中获得了更好的匹配效果,即使在纹理较少的区域也是如此。

Notice that dense-feature extraction and description requires no additional computation at query time as the intermediate convolutional layers are already computed when extracting the NetVLAD descriptors as described in section 4.1.
请注意,密集特征提取和描述不需要在查询时进行额外计算,因为在提取 NetVLAD 描述符时已经计算了中间卷积层,如第 4.1 节所述。

As will also be demonstrated in section 5, memory requirements and computational speed of feature matching can be addressed by binarizing the convolutional features without loss in matching performance.
第 5 节中还将演示,通过对卷积特征进行二值化处理,可以在不降低匹配性能的情况下满足内存要求并加快特征匹配的计算速度。

As perspective images in our database have depth values, and hence associated 3D points, the query camera pose can be estimated by finding pixel-to-pixel correspondences between the query and the matching database image followed by P3P-RANSAC [25].
由于我们数据库中的透视图像具有深度值,因此具有相关的3D点,因此可以通过查找查询与匹配数据库图像之间的像素对像素对应关系来估计查询相机姿态,然后进行P3P-RANSAC [25]

4.3. Pose verification with view synthesis 用视图合成进行姿态验证

We propose here to collect both positive and negative evidence to determine what is and is not matched3 .

This is achieved by harnessing the power of the high-quality RGBD image database that provides a dense and accurate 3D structure of the indoor environment.

This structure is used to render a virtual view that shows how the scene would look like from the estimated query pose.

The rendered image enables us to count, in a pixel-wise manner, both positive and negative evidence by counting which regions are and are not consistent between the query image and the underlying 3D structure.

To gain invariance to illumination changes and small misalignments, we evaluate image similarity by comparing local patch descriptors (DenseRootSIFT [7, 41]) at corresponding pixel locations.

The final similarity is computed as the median of descriptor distances across the entire image while ignoring areas with missing 3D structure

5. Experiments 实验

We first describe the experimental setup for evaluating visual localization performance using our dataset (Section 5.1).
The proposed method, termed “InLoc”, is compared with state-of-the-art methods (Section 5.2) and we show the benefits of each component in detail (Section 5.3).

5.1. Implementation details 实现细节

In the candidate pose retrieval step, we retrieve 100 candidate database images using NetVLAD.

We use the implementation provided by the authors and the pre-trained Pitts30K [6] VGG-16 [61] model to generate 4; 096- dimensional NetVLAD descriptor vectors.
我们使用作者提供的实现和预训练的匹兹堡30k [6] VGG-16[61]模型生成4;96维NetVLAD描述符向量。

In the second pose estimation step, we obtain tentative correspondences by matching densely extracted convolutional features in a coarse-to-fine manner: we first find mutually nearest matches among the conv5 features and then find matches in the finer conv3 features restricted by the coarse conv5 correspondences.

The tentative matches are geometrically verified by estimating up to two homographies using RANSAC [25]. We re-rank the 100 candidates using the number of RANSAC inliers and keep the top-10 database images.

For each of the 10 images, the 6DoF query pose is computed by P3P-LO-RANSAC [37] (referred to as DensePE), assuming a known focal length, e.g., from EXIF data, using the inlier matches and depth (i.e. the 3D structure) associated to each database image.

In the final pose verification step, we generate synthesized views by rendering colored 3D points while taking care of self-occlusions.

For computing the scores that measure the similarities of the query image and the image rendered from the estimated pose, we use the DenseSIFT extractor and its RootSIFT descriptor [7, 41] from VLFeat [71]4 .
为了计算衡量查询图像和根据估计姿势渲染的图像相似度的分数,我们使用了 VLFeat [71]4 中的 DenseSIFT 提取器及其 RootSIFT 描述符 [7, 41] 。

Finally, we localize the query image by the best pose among its top-10 candidates
最后,我们通过前 10 个候选姿势中的最佳姿势定位查询图像

Evaluation metrics. We evaluate the localization accuracy as the consistency of the estimated poses with our reference poses.

We measure positional and angular differences in meters and degrees between the estimated poses and the manually verified reference poses.

5.2. Comparison with the state-of-the-art methods(与最先进的方法比较)

Direct 2D-3D matching [53, 55]. We first compare with a variation5 of a state-of-the-art 3D structure-based image localization approach [53].
直接 2D-3D 匹配 [53, 55]。我们首先与最先进的基于三维结构的图像定位方法[53]的一个变体5 进行比较。

We compute affine covariant RootSIFT features for all the database images and associate them with 3D coordinates via the known scene geometry.
我们计算所有数据库图像的仿射协变 RootSIFT 特征,并通过已知的场景几何将其与三维坐标相关联。

Features extracted from a query image are then matched to the database 3D descriptors [46].

We select at most five database images receiving the largest numbers of matches and use all these matches together for pose estimation.

Similar to [53], we did not apply Lowe’s ratio test [42] as it lowered the performance. The 6DoF query pose is finally computed by P3P-LO-RANSAC [37].
与文献[53]类似,我们没有采用洛氏比率测试[42],因为它会降低性能。6DoF 查询姿势最终由 P3P-LO-RANSAC [37] 计算得出。

As shown in table 2, In-Loc outperforms direct 2D-3D matching by a large margin (40:7% at the localization accuracy of 0.5m).
如表 2 所示,In-Loc 的性能远远超过直接 2D-3D 匹配(在定位精度为 0.5m 时为 40:7%)。

We believe that this is because our large-scale indoor dataset involves many distractors and large viewpoint changes that present a major challenge for 3D structure-based methods.

Disloc [9] + sparse pose estimation (SparsePE) [51]. We next compare with the state-of-the-art image retrieval-based localization method.
Disloc[9] +稀疏姿态估计(SparsePE)[51]。接下来,我们将与基于图像检索的最先进的定位方法进行比较。

Disloc represents images using bagof-visual-words with Hamming-Embedding [31] while also
taking local descriptor space density into account.
Disloc使用带有Hamming-Embedding的bag - of-visual-words来表示图像[31],同时也考虑了局部描述子空间密度。

We use a publicly available implementation [54] of Disloc with a 200K vocabulary trained on affine covariant features [45], described by RootSIFT [7], extracted from the database images of our indoor dataset.

The top-100 candidate images shortlisted by Disloc are re-ranked by spatial verification [51] using (sparse) affine covariant features [45].

The ratio test [42] was not applied here as it was removing too many features that need to be retained in the indoor scenario.

Using the inliers, the 6DoF query pose is computed with P3P-LO-RANSAC [37].

To make a fair comparison, we use exactly the same features and P3P-LORANSAC for pose estimation as the direct 2D-3D matching method described above.

As shown in table 2, Disloc [9]+SparsePE [51] results in a 13:7% performance gain compared to Direct 2D-3D matching [55].
如表2所示,与Direct 2D-3D匹配相比,Disloc [9]+SparsePE[51]的性能提高了13:7%[55]。

This can be attributed to the image retrieval step that discounts burst of repetitive features.

However, the results are still significantly worse compared to our InLoc approach

NetVLAD [6] + sparse pose estimation (SparsePE) [51].
NetVLAD[6] +稀疏姿态估计(SparsePE)[51]。

We also evaluate a variation of the above image retrievalbased localization method.

Here the candidate shortlist is obtained by NetVLAD [6], which is then re-ranked using SparsePE [51], followed by pose estimation using P3PLO-RANSAC [37].

This is a strong baseline building on the state-of-the-art place recognition results obtained by [6]. Interestingly, as shown in table 2, there is no significant difference between NetVLAD+SparsePE and DisLoc+SparsePE, which is in line with results reported in outdoor settings [57].

Yet, NetVLAD outperforms DisLoc (5:8% at the localization accuracy of 0.5m) before reranking via SparsePE (c.f . figure 5) in this indoor setting(see also figure 4).
然而,在通过SparsePE (c.f.c)重新排序之前,NetVLAD在0.5m的定位精度下优于DisLoc(5:8%)。

Overall, both methods, even though they represent the state-of-the-art in outdoor localization, still perform significantly worse than our proposed approach based on dense feature matching and view synthesis.

5.3. Evaluation of each component

Next, we demonstrate the benefits of the individual components of our approach.

Benefits of pose estimation using dense matching. Using the NetVLAD retrieval as the base retrieval method
(Figure 5 (a)), our pose estimation with dense matching (NetVLAD [6]+DensePE (blue line)) constantly improves the localization rate by about 15% when compared to the state-of-the-art sparse local feature matching (NetVLAD [6]+SparsePE (green line)).
使用密集匹配进行姿态估计的好处。使用NetVLAD检索作为基础检索方法(图5 (a)),我们的密集匹配姿态估计(NetVLAD [6]+DensePE(蓝线))与最先进的稀疏局部特征匹配(NetVLAD [6]+SparsePE(绿线))相比,定位率不断提高约15%。
不同组件的影响。图中显示了密集匹配(DensePE)和密集姿态验证(DensePV)对(a) NetVLAD检索的姿态候选和(b)最先进基线的姿态估计质量的影响。图表显示了在一定距离(x轴)内正确定位查询(y轴)的比例,其旋转误差最多为10◦。

This result supports our conclusion that dense feature matching and verification is superior to sparse feature matching for often weakly textured indoor scenes.

This effect is also clearly demonstrated in qualitative results in figure 4 (cf. columns 3 and 4).

Benefits of pose verification with view synthesis. We apply our pose verification step (DensePV) to the top–10 pose estimates obtained by different spatial re-reranking methods.

Results are shown in figure 5 and demonstrate significant and consistent improvements obtained by our pose verification approach (compare “-•-” to “—” in figure 5).

Improvements are most pronounced for the position accuracy within 1.5 meters (13% or more).

Binarized representation. A binary representation (instead of floats) of features in the intermediate CNN layers
significantly reduces memory requirements.

We use feature binarization that follows the standard Hamming embedding approach [31] but without dimensionality reduction.

Matching is then performed by computing Hamming distances.

This simple binarization scheme results in a negligible performance loss (less than 1% at 0.5 meters) compared to the original descriptors, which is in line with results reported for object recognition [4].

At the same time, binarization reduces the memory requirements by a factor of 32, compressing 428GB of original descriptors to just 13.4GB.

Comparison with learning based localization methods.

We have attempted a comparison with DSAC [11], which is a state-of-the-art pose estimator for indoor scenes.

Despite our best efforts, training DSAC on our indoor dataset failed to converge.

We believe this is because the RGBD scans in our database are sparsely distributed [76] and each scan has only a small overlap with neighboring scans.

Training on such a dataset is challenging for methods designed for densely captured RGBD sequences [26].

We believe this would also be the case for PoseNet [34], another method for CNN-based pose regression.

We do provide the comparison with DSAC and PoseNet on much smaller datasets next.

5.4. Evaluation on other datasets 对其他数据集的评价

We also evaluate InLoc on two existing indoor datasets [17, 59] to confirm the relevance of our results.

The Matterport3D [17] dataset consists of RGBD scans of 90 buildings.

Each RGBD scan contains 18 images that capture the scene around the scan position with known camera poses.

We created a test set by randomly choosing 10% of the scan positions and selected their horizontal views.

This resulted in 58,074 database images and a query set of 6,726 images.

Results are shown in table 3. Our approach (InLoc) outperforms the baselines, which is in line with results on the InLoc dataset.

We also tested PoseNet [34] and DSAC [11] on a single (the largest) building.

The test set is created in the same manner as above and contains 1,884 database images and 210 query images.

Even in this much easier case, DSAC fails to converge.

PoseNet produces large localization errors (24.8 meters and 80.0 degrees) in comparison with InLoc (0.26 meters and 2.78 degrees).

We also report results on the 7 Scenes dataset [26, 59] which is, while relatively small, a standard benchmark for indoor localization.
我们还报告了 7 个场景数据集 [26, 59]的结果,该数据集虽然相对较小,但却是室内定位的标准基准。

The 7 Scenes dataset [59] consists of geometrically-registered video frames representing seven scenes, together with associated depth images and camera poses.
7 种场景数据集 [59] 由代表 7 种场景的几何注册视频帧以及相关深度图像和摄像机姿势组成。

Table 4 shows localization results for our approach (NetVLAD+DensePE) compared with state-of-theart methods [11, 34, 55].
表 4 显示了我们的方法(NetVLAD+DensePE)与先进方法 [11, 34, 55] 的定位结果对比。
7 scene数据集的评价[26,59]。数字显示中间位置(厘米)和角度误差(度)。

Note that our approach performs comparably to these methods on this relatively small and densely captured data, while it does not need any scene specific training (which is needed by [11, 34]).
需要注意的是,我们的方法在这种相对较小的密集捕获数据上的表现与这些方法不相上下,而且不需要任何特定场景的训练([11, 34]需要这种训练)。

6. Conclusion

We have presented InLoc – a new approach for large-scale indoor visual localization that estimates the 6DoF camera
pose of a query image with respect to a large indoor 3D map.

To overcome the difficulties of indoor camera pose estimation, we have developed new pose estimation and verification methods that use dense feature extraction and matching in a sequence of progressively stricter verification steps.

The localization performance is evaluated on a new large indoor dataset with realistic and challenging query images captured by mobile phones.

Our results demonstrate significant improvements compared to state-of-theart localization methods.

To encourage further progress on high-accuracy large-scale indoor localization, we make our dataset publicly available [1].

[注] :后文有附录,未翻译,请自行查看

  • 0
  • 1
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


